Google's WaveNet machine learning-based speech synthesis comes to Assistant

Last year, Google showed off WaveNet, a new way of generating speech that didn’t rely on a bulky library of word bits or cheap shortcuts that result in stilted speech. WaveNet used machine learning to build a voice sample by sample, and the results were, as I put it then, “eerily convincing.” Previously bound to the lab, the tech has now been deployed in the latest version of Google Assistant.

The general idea behind the tech was to recreate words and sentences not by coding grammatical and tonal rules manually, but allowing a machine learning system to see those patterns in speech and generate them sample by sample. A sample, in this case, being the tone generated every 1/16,000th of a second.

At the time of its first release, WaveNet was extremely computationally expensive, taking a full second to generate 0.02 seconds of sound — so a two-second clip like “turn right at Cedar street” would take nearly two minutes to generate. As such, it was poorly suited to actual use (you’d have missed your turn by then) — which is why Google engineers set about improving it.

The new, improved WaveNet generates sound at 20x real time — generating the same two-second clip in a tenth of a second. And it even creates sound at a higher sample rate: 24,000 samples per second, and at 16 versus 8 bits. Not that high-fidelity sound can really be appreciated in a smartphone speaker, but given today’s announcements, we can expect Assistant to appear in many more places soon.

The voices generated by WaveNet sound considerably better than the state of the art concatenative systems used previously:

Old and busted:

New and hot:

(More samples are available at the Deep Mind blog post, though presumably the Assistant will also sound like this soon.)

WaveNet also has the admirable quality of being extremely easy to scale to other languages and accents. If you want it to speak with a Welsh accent, there’s no need to go in and fiddle with the vowel sounds yourself. Just give it a couple dozen hours of a Welsh person speaking and it’ll pick up the nuances itself. That said, the new voice is only available for U.S. English and Japanese right now, with no word on other languages yet.

In keeping with the trend of “big tech companies doing what the other big tech companies are doing,” Apple, too, recently revamped its assistant (Siri, don’t you know) with a machine learning-powered speech model. That one’s different, though: it didn’t go so deep into the sound as to recreate it at the sample level, but stopped at the (still quite low) level of half-phones, or fractions of a phoneme.

The team behind WaveNet plans to publish its work publicly soon, but for now you’ll have to be satisfied with their promises that it works and performs much better than before.