VALL-E is an AI model that can imitate a person's voice with a 3-second audio sample.

 image credit: Google 

Capable of generating audio of a person speaking any words while preserving their emotional tone

 image credit: Google 

Envisioned to be used in combination with other generative AI models like GPT-3

 image credit: Google 

Can be used for high-quality text-to-speech applications, speech editing, and audio content creation

 image credit: Google 

Based on EnCodec, a neural codec language model revealed by Meta in October 2022

 image credit: Google 

Uses discrete audio codec codes from text and acoustic prompts to generate speech

 image credit: Google 

Taught to synthesize speech using Meta's LibriLight library containing 60,000 hours of English language recordings from 7,000 speakers

 image credit: Google 

Three-second audio sample must be similar to the voice used in VALL-E's training for successful output

 image credit: Google 

Includes example audio samples on VALL-E's website for comparison of accuracy

image credit: Google

Produces more accurate results compared to conventional text-to-speech synthesis system

image credit: Google