Microsoft researchers have unveiled the latest AI model, VALL-E, which can successfully imitate the voice of a person with just a 3-second audio sample. Once it has learned a voice, VALL-E is capable of generating audio of that person speaking any words, while still preserving their emotional tone. The creators of VALL-E envision it being used in combination with other generative AI models, like GPT-3, to create high-quality text-to-speech applications, speech editing in which a recording of a person could be altered from a text transcript, and audio content creation.
Also Read: WhatsApp is Developing a Chat Transfer Feature for Android that Does Not Require Cloud Backups.
Microsoft has stated that VALL-E is a “neural codec language model” based on EnCodec, which was revealed by Meta in October 2022. In contrast to other text-to-speech techniques that typically generate speech by manipulating waveforms, VALL-E creates discrete audio codec codes from text and acoustic prompts. This process involves taking how a person sounds, breaking it down into distinct elements (known as “tokens”) using EnCodec, and then using training data to replicate what it “knows” about how that voice might sound if it spoke something other than the three-second sample.
Also Read: LG लेकर आया है 97 इंच का नया वायरलेस OLED TV
VALL-E was taught the ability to synthesize speech by Microsoft using Meta’s LibriLight library. It contains 60,000 hours of English language recordings from 7,000 different speakers, mostly taken from LibriVox public domain audiobooks. To ensure a successful output, the voice in the three-second sample must be similar to the voice used in the training of VALL-E.
On the VALL-E example website, the American technology giant provides a variety of audio samples of the AI model in action. The “Speaker Prompt” data set is a three-second audio clip given to VALL-E to replicate. To compare the accuracy of the AI model, the “Ground Truth” audio sample is a pre-recorded version of the same speaker saying the same phrase. The “Baseline” sample is generated by a conventional text-to-speech synthesis system, while the “VALL-E” sample is generated by the VALL-E model.