Microsoft Unveils VALL-E, new text-to-speech AI model. Know More About it

11 Jan 2023 11:44 IST

Microsoft researchers have unveiled the latest AI model, VALL-E, which can successfully imitate the voice of a person with just a 3-second audio sample. Once it has learned a voice, VALL-E is capable of generating audio of that person speaking any words, while still preserving their emotional tone. The creators of VALL-E envision it being used in combination with other generative AI models, like GPT-3, to create high-quality text-to-speech applications, speech editing in which a recording of a person could be altered from a text transcript, and audio content creation.

Advertisment

Also Read: WhatsApp is Developing a Chat Transfer Feature for Android that Does Not Require Cloud Backups.

Microsoft has stated that VALL-E is a "neural codec language model" based on EnCodec, which was revealed by Meta in October 2022. In contrast to other text-to-speech techniques that typically generate speech by manipulating waveforms, VALL-E creates discrete audio codec codes from text and acoustic prompts. This process involves taking how a person sounds, breaking it down into distinct elements (known as "tokens") using EnCodec, and then using training data to replicate what it "knows" about how that voice might sound if it spoke something other than the three-second sample.

Microsoft Unveils VALL-E, new text-to-speech AI model — LG लेकर आया है 97 इंच का नया वायरलेस OLED TV

VALL-E was taught the ability to synthesize speech by Microsoft using Meta's LibriLight library. It contains 60,000 hours of English language recordings from 7,000 different speakers, mostly taken from LibriVox public domain audiobooks. To ensure a successful output, the voice in the three-second sample must be similar to the voice used in the training of VALL-E.

Advertisment

Also Read: Realme GT Neo 5 Receives 3C Certifications, Design Revealed in TENAA Listing

Microsoft Unveils VALL-E text-to-speech text-to-speech AI #Microsoft