DeepMind produces ‘significantly better’ text-to-speech with WaveNet

Spread the love

Google’s artificial intelligence business, DeepMind, has made a big leap in computer-generated speech. The WaveNet text-to-speech engine can pronounce English and Mandarin in a way that makes it seem almost real.

The WaveNet self-learning neural network produces raw sound waves and learned its ‘speech’ by training on data at tens of thousands of sound samples per second. A single WaveNet can take the natural way of speaking of different speakers and switch between them. In addition to training on speech, the researchers also had WaveNet analyzed music fragments, after which WaveNet could create new and realistic-sounding piano pieces. The model also recognizes differences between phonemes, ie the smallest sound units that reveal a difference in meaning.

Example of the structure of 1 second of speech: up to 16,000 sample particles. Source: DeepMind

The researchers were able to achieve the results by training WaveNet with raw sound waves. This is a method that is often avoided, the researchers write on their blog. A sound fragment consists of about 16,000 samples per second or more. To get good-sounding speech, each sample must be appropriately influenced by all of the preceding bits. Because the samples are built up step by step, a lot of computing power is required. The chance that this technology will soon end up in consumer products is therefore small.

Way in which WaveNet takes bits from the previous sample to get to the next one. Source: DeepMind

To find out how well WaveNet’s pronunciation compares to other text-to-speech systems and to text spoken by real people, the researchers presented test samples to a panel. WaveNet scored 4.21 in American English and 4.08 in Mandarin on a scale of 1 to 5. Real People scored 4.55 in English and 4.21 in Mandarin. The researchers also had WaveNet come up with a kind of ‘language’ themselves. This sounds like human language, but it isn’t.

There are several samples on the WaveNet website. In addition, a research article was published. DeepMind, the business unit from which WaveNet originates, is the company’s artificial intelligence arm. It has already achieved several successes, including playing the game of go and eye examination.

Parametric speech synthesizer

Concatenative speech synthesizer

WaveNet

You might also like