Will We Ever Bury Voice Recognition?

star trek

Over on our sister blog Techcrunch, we just reviewed a new Windows-based IM client called Say2Go. The client itself is simple, but the main value-add presented is being able to record a message and then have it transcribed automatically and then sent as a text message. Microsoft thought that this use of their new speech technology was so great that they awarded the company first prize in their ISV/Partner application contest.

Microsoft, and particularly Bill Gates, has been a strong advocate of voice recognition technology. Gates said in 1997:

“In this 10-year time frame, I believe that we’ll not only be using the keyboard and the mouse to interact, but during that time we will have perfected speech recognition and speech output well enough that those will become a standard part of the interface.”

Voice recognition is a constant theme in his forward-looking technology books, speeches and interviews – but along with other Gates-backed technologies such as the wallet PC, the PDA watch and an operating system using real-world objects as an interface (Microsoft Bob), voice recognition has so far fallen flat and way below earlier expectations.

In using the new Microsoft library as part of Say2Go, the problems are immediate and obvious. First of all, the training process initially takes 10 minutes to get started – but requires a lot more user input, learning and training to move anywhere beyond 80% accuracy (80% is the Microsoft claim – and somewhat ironically the Gates book ‘The Road Ahead’ is used as text for training).

The real-world level of accuracy is much lower, and with my accent even speaking clearly and slowly and with 30 or more minutes of training it couldn’t string together more than three or four words correctly. Whole sentences completely lost any meaning, and it seems pointless using voice and speaking at 30 words per minute when you have a perfectly capable input device in front of you capable of 60, 90 or over 100 words per minute (specifically on instant messaging, where a whole new English lexicon has formed with shortened words and phrases).

In a real-world enterprise environment, it is impossible to imagine a room full of people all using voice dictation at their computers. The background noise is difficult to filter out, and the modern office environment is full of interruptions with phones ringing, instant messages, new emails and more. When typing at a keyboard, you can easily multi-task and stop/start easily while switching between programs. With voice recognition, you need to pause or stop recording and specifically tell the application when you are actually speaking to it by pressing a button.

Portable devices previously did not have the luxury of a full QWERTY keyboard, but recent interface advances such as multi-touch, the virtual keyboard on the iPhone and predictive input technology from Nokia has bought input accuracy and speed up to almost-QWERTY levels. These technologies make voice recognition look like an unecassery once-futuristic technology born out of sci-fi movies, and that is probably where voice recognition should stay.

It is 2008, eleven years after the Bill Gates “within a decade” quote, and Microsoft is still pushing voice – and Say2Go is the next iteration of a process that has been both time consuming and expensive for the company. While other companies such as Nokia and Apple innovated with clever interfaces, Microsoft has stuck with voice and their vision around it. With each new application release using their library, and the ensuing embarresing recognition results, the Microsoft bet on voice is looking less and less likely to pay off.