VALL-E is an AI model that can imitate a person's voice with a 3-second audio sample. image credit: Google
VALL-E is an AI model that can imitate a person's voice with a 3-second audio sample. image credit: Google
Capable of generating audio of a person speaking any words while preserving their emotional tone image credit: Google
Envisioned to be used in combination with other generative AI models like GPT-3 image credit: Google
Can be used for high-quality text-to-speech applications, speech editing, and audio content creation image credit: Google
Based on EnCodec , a neural codec language model revealed by Meta in October 2022 image credit: Google
Uses discrete audio codec codes from text and acoustic prompts to generate speech image credit: Google
Taught to synthesize speech using Meta's LibriLight library containing 60,000 hours of English language recordings from 7,000 speakers image credit: Google
Three-second audio sample must be similar to the voice used in VALL-E's training for successful output image credit: Google
Includes example audio samples on VALL-E's website for comparison of accuracy image credit: Google
Read More...
Produces more accurate results compared to conventional text-to-speech synthesis system image credit: Google
Read More...