The Power Of Voice: A Conversation With The Head Of Google's Speech Technology

For all the whiz-bang graphics and nifty apps appearing on smart phones these days, there are still few things that feel more futuristic than pulling out your phone, uttering the words, “find directions to the Exploratorium”, and having Google immediately do your bidding. The technology is becoming widely available via apps on the iPhone and deep integration into Android, and this is really only the beginning.

Earlier this month I had the chance to sit down with Mike Cohen, the man who leads all of Google’s speech technology efforts, to get a look behind the curtain at why Google has invested so much into voice, and where things are going from here.

A Look Back

Before we discuss where we stand now, it’s worth looking at Cohen’s past, which also serves as a good history lesson on speech technology. Cohen has been at Google since 2004, but he’s been straddling the intersection of voice and technology for decades, getting his start at the Stanford Research Institute in the early 1980s.

Cohen says that in the 1970s there were two main camps working on speech: linguists and engineers. The linguists were all about rules — they’d identify various trends in grammar and pronunciation and how each phoneme interacted with the others. The engineers were taking a different approach: rather than trying to painstakingly identify each rule manually, they set out to build complex statistical models that improved as more speech data was fed into them.

By the late 70s and early 80s, when Cohen started doing research at SRI, the engineers were in the lead. But there was a problem: the improvements seen in their models were starting to asymptote. Cohen explains that because these models were always the same, feeding them more data was eventually going to provide diminishing returns (for example, their models were bad at recognizing how pronunciation depends not only which words are being said, but also their context). The engineers needed to find a way to build richer models — so they finally began to collaborate with the linguists. And a research boom ensued.

By the early 90s speech technology had gotten sufficiently advanced that researchers could create the DARPA-funded Air Travel Information System (ATIS) — where a user could walk up to a terminal, say, “Show me the flights from Boston”, and the computer would spit back the relevant data. The system could understand countless variations on such commands (you didn’t have to memorize certain keywords) — pretty amazing given the fact that this system was built around the time Windows 95 came out.

Based on the success of the ATIS, Cohen decided that the technology was ready for commercialization, so he and three cofounders left to start Nuance. The company focused on building automated enterprise call systems, which it then sold to major businesses that had to deal with high inbound call volume — things like an automated stock quote system for Charles Schwab, and customer service for phone companies.

Given his history as a researcher, it isn’t surprising that Cohen was looking at ways to improve Nuance’s speech recognition software. And, as it turned out, the huge number of call recordings coming in were even more useful than the data he’d had access to while a researcher at SRI. He explains that there are things that can’t be reproduced in a lab environment — a dog barking in the background, a child crying, and so on — that were present in these inbound phone calls, exposing Nuance to important new challenges in speech analysis.

But there was one big problem: despite the fact that its technology was dealing with a huge volume of data, Nuance would have to approach each of its enterprise customers and ask for access to this data for research purposes. Enterprises stood to gain because they’d reap any improvements in the technology, but some of them were wary anyway. Which set the stage for Cohen to finally make the jump to Google.

GOOG-411 and Beyond

In 2004 Google’s voice efforts were basically non-existent. But Cohen saw an opportunity: even then it was clear that mobile was going to have a big impact on the future of technology. And because Google faces the end-user directly, any incoming voice data would be immediately accessible for research purposes. So he made the switch to the search giant, and began what became Google’s free 411 voice service, GOOG-411.

The service launched in 2007, offering a straightforward and handy feature set: you’d call in, ask for some basic information like a business’s phone number, and it would immediately give you that information free of charge. Cohen says the main motivation for launching GOOG-411 was the fact that it’s useful, but it had an important secondary function: it allowed Google to begin building up a massive corpus of voice data. Remember the data models discussed earlier? Google’s speech systems use similar concepts, but at a much larger scale.

GOOG-411 was killed off in October, but Google now has more inputs of voice data, including the microphone button seen throughout Android and the Google Mobile application for iPhone. And Google can look at text-based search queries to identify what terms appear most often after each other. All of which means Google can train its language models relatively quickly.

These days, Cohen says that Google uses 230 billion search queries to train the language model used by Google’s speech recognizer. To give an idea of how large that volume of data is, he says the training would take 70 years to be completed on a single CPU (though Google obviously has far greater resources).

The technology is now used across a variety of products. YouTube automatically captions millions of videos. Google Voice attempts to transcribe inbound voice messages (with some pretty hilarious results). And voice search is going to play a much bigger role on mobile devices — don’t be surprised if we start seeing cars with media centers running Android in the not-so-distant future. You can bet they’ll be voice-enabled.

Cohen was happy to talk in broad terms about Google’s voice efforts, but he was opaque when it came to sharing stats, upcoming features, and predictions. He wouldn’t discuss the kind of voice search volume that Google sees, though he did acknowledge that it fluctuates widely depending on if a new voice-enabled feature has launched and if there has been recent coverage in the press.

When I asked him how long it would be before voice search would become accurate to the point where we take it for granted (and didn’t have to check for typos), he declined to really offer a projection (he noted that he could say something like “five years”, but that that’s just research terminology for “I have no idea”).

I also asked him what he thought about Apple’s voice efforts — the company acquired Siri last year, and it seems obvious that it’s going to begin incorporating voice into iOS. Again, Cohen didn’t have much to say here (though this wasn’t really surprising). He did say that Google has the natural advantage of having already released a product that gives it a massive volume of data, but ultimately it will come down to what Apple builds and who they partner with.

But while he wouldn’t get into specifics, Cohen did share Google’s long-term vision for this technology: it wants speech input to be completely ubiquitous. “We don’t ever want there to be a scenario where speech would be valuable, if only it had been available — just like you can enter text with a keyboard anywhere, you should be able to do it with speech.” And accuracy is a big part of that: “It needs to work so close to perfect that the choice isn’t based on performance, but on end-user preference.”