Google's DeepMind Teaches Computers How to Speak Human

Apple's Siri personal assistant is getting a lot smarter in the upcoming iOS 10, but odds are she'll still sound like a computer. Meanwhile, a subsidiary of Google (her creator's rival) is working on an entirely new model for teaching computers to convert text to speech.

It's called WaveNet, and Google says it can mimic any human voice while sounding more natural than text-to-speech algorithms available today.

WaveNet is based on research from DeepMind, which this week offered an in-depth look at its efforts to synthesize audio signals for more natural-sounding artificial voices. It all starts with convolutional neural networks, the same technology that powers everything from self-driving cars to disease detection.

Neural networks also now power some current text-to-speech products, including Siri, which two years ago was rebuilt to take advantage of this form of machine learning. But Siri and her colleagues, like Google Voice Search or Amazon's Alexa, still use a database of short speech fragments that are strung together to form complete words and sentences. The result is a halting, emotionless voice, even if it is understandable.

What if instead of using speech fragments, there was a way to efficiently compile pure audio waveforms? Not only would that allow for more natural-sounding speech, but it would also let the computer mimic virtually any sound, including the ability to faithfully reproduce music. DeepMind engineers set to work.

At first, they waged an uphill battle thanks to the inherent density of raw audio, which requires more than 16,000 samples a second for a computer to process. But the engineers were at last able to build a neural network that uses real waveforms from human speakers. They sampled each recording to create a probability distribution of utterances—in essence, teaching the computer how to speak like a human.

"Building up samples one step at a time like this is computationally expensive," according to DeepSense, "but we have found it essential for generating complex, realistic-sounding audio."

The result is remarkable. DeepSense provided samples of its speech capabilities alongside those typically used today, and the difference in inflection, tone, and emotion is immediately apparent. Have a listen for yourself.

It's only natural that computers' speech synthesis will become more, well, natural: Google and its competitors have invested significant resources in developing personal assistants. In order for them to catch on, humans need to think of them less as a gimmick and more as articulate, pleasant robots.

About Our Expert

Tom Brant

Managing Editor

I’m a managing editor at PCMag.com focused on PC hardware. Reading this during the day? Then you've caught me testing gear and editing reviews of Wi-Fi routers, printers, laptops, and tons of other personal tech. (Reading this at night? Then I’m probably dreaming about all those cool products.) I’ve covered the consumer tech world as an editor, reporter, and analyst since 2015.

I've covered most major consumer tech events, including CES, Computex, Google I/O, and IFA. I've also appeared on CBS News, in USA Today, and at many other outlets to offer analysis on breaking technology news.

Before I joined the tech-journalism ranks, I wrote on topics as diverse as Borneo's rainforests, Middle Eastern airlines, and Big Data's role in presidential elections. A graduate of Middlebury College, I also have a master's degree in journalism and French Studies from New York University.

The Technology I Use

While most people buy a phone or laptop and stick with it for years, I’m lucky enough to use devices based on Android, iOS, macOS, and Windows daily as part of my job. As a result, I cycle through lots of tech in addition to my IT-issue work laptop. (Yes, that's a ThinkPad.) Personally, I’ve also owned a lot of tech products both cutting-edge and cringeworthy, from the Nintendo GameCube and the original MacBook to the Palm m105 and the CueCat.

Read the latest from Tom Brant

Read full bio

Google's DeepMind Teaches Computers How to Speak Human

Soon, you might have a hard time telling the difference between human and computer voices.

Related Articles

About Our Expert

Tom Brant

Managing Editor

The Technology I Use

Read the latest from Tom Brant

Comments