VALL-E is an AI model that can imitate a person's voice with a 3-second audio sample.
image credit: Google
Capable of generating audio of a person speaking any words while preserving their emotional tone
image credit: Google
Envisioned to be used in combination with other generative AI models like GPT-3
image credit: Google
Can be used for high-quality text-to-speech applications, speech editing, and audio content creation
image credit: Google
Based on EnCodec, a neural codec language model revealed by Meta in October 2022
image credit: Google
Uses discrete audio codec codes from text and acoustic prompts to generate speech
image credit: Google
Taught to synthesize speech using Meta's LibriLight library containing 60,000 hours of English language recordings from 7,000 speakers
image credit: Google
Three-second audio sample must be similar to the voice used in VALL-E's training for successful output
image credit: Google
Includes example audio samples on VALL-E's website for comparison of accuracy
image credit: Google
Produces more accurate results compared to conventional text-to-speech synthesis system
image credit: Google