Microsoft’s AI model Vall-E can imitate voice with a three-second sample
Microsoft is showing off a new AI model called Vall-E. According to the tech giant, this text-to-speech model can generate spoken sentences in almost any voice after hearing a three-second sample. The AI model can also mimic intonation and emotion.
Vall-E uses a language model and has been trained with 60,000 hours of English voice recordings, write researchers in a research paper. According to the makers, the tool can imitate a voice after hearing a sample of three seconds. With that, the tool can then produce audio clips using the voice from the input via a written prompt.
The Vall-E model was trialled by students at Cornell University, who a website with several demos published. Several real voice recordings can be heard on this web page, which have been used as a sample for Vall-E. Each sample publishes one or more synthetic voice recordings generated by Vall-E. The quality thereof varies; some recordings sound convincing, while others clearly indicate that they were generated by software.
Researchers state that Vall-E outperforms current text-to-speech models in many cases. However, the research also writes that the AI model still has several problems at the moment. For example, certain words from the text prompt may be slurred, missed completely, or duplicated in the output. In addition, the model currently has difficulty imitating certain voices, especially voices with an accent.
Such AI models are also controversial, as they can also be used to imitate someone’s voice without permission. The researchers acknowledge in their research paper that the AI model can be abused. They argue that it is possible to develop a detection model that can recognize whether a sound fragment has been generated by Vall-E.
At this time, Vall-E is not yet publicly available. Microsoft does put a Vall-E repository on GitHub, but it currently only contains a limited readme file. The tech giant does not say whether and when the tts model will be widely available.
Vall-E’s pipeline: A text prompt and a 3-second audio sample are fed to a language model and then used to generate synthetic speech recordings