Will We Ever Bury Voice Recognition?

Next Story

Earth makes a terrible racket

star trek

Over on our sister blog Techcrunch, we just reviewed a new Windows-based IM client called Say2Go. The client itself is simple, but the main value-add presented is being able to record a message and then have it transcribed automatically and then sent as a text message. Microsoft thought that this use of their new speech technology was so great that they awarded the company first prize in their ISV/Partner application contest.

Microsoft, and particularly Bill Gates, has been a strong advocate of voice recognition technology. Gates said in 1997:

“In this 10-year time frame, I believe that we’ll not only be using the keyboard and the mouse to interact, but during that time we will have perfected speech recognition and speech output well enough that those will become a standard part of the interface.”

Voice recognition is a constant theme in his forward-looking technology books, speeches and interviews – but along with other Gates-backed technologies such as the wallet PC, the PDA watch and an operating system using real-world objects as an interface (Microsoft Bob), voice recognition has so far fallen flat and way below earlier expectations.

In using the new Microsoft library as part of Say2Go, the problems are immediate and obvious. First of all, the training process initially takes 10 minutes to get started – but requires a lot more user input, learning and training to move anywhere beyond 80% accuracy (80% is the Microsoft claim – and somewhat ironically the Gates book ‘The Road Ahead’ is used as text for training).

The real-world level of accuracy is much lower, and with my accent even speaking clearly and slowly and with 30 or more minutes of training it couldn’t string together more than three or four words correctly. Whole sentences completely lost any meaning, and it seems pointless using voice and speaking at 30 words per minute when you have a perfectly capable input device in front of you capable of 60, 90 or over 100 words per minute (specifically on instant messaging, where a whole new English lexicon has formed with shortened words and phrases).

In a real-world enterprise environment, it is impossible to imagine a room full of people all using voice dictation at their computers. The background noise is difficult to filter out, and the modern office environment is full of interruptions with phones ringing, instant messages, new emails and more. When typing at a keyboard, you can easily multi-task and stop/start easily while switching between programs. With voice recognition, you need to pause or stop recording and specifically tell the application when you are actually speaking to it by pressing a button.

Portable devices previously did not have the luxury of a full QWERTY keyboard, but recent interface advances such as multi-touch, the virtual keyboard on the iPhone and predictive input technology from Nokia has bought input accuracy and speed up to almost-QWERTY levels. These technologies make voice recognition look like an unecassery once-futuristic technology born out of sci-fi movies, and that is probably where voice recognition should stay.

It is 2008, eleven years after the Bill Gates “within a decade” quote, and Microsoft is still pushing voice – and Say2Go is the next iteration of a process that has been both time consuming and expensive for the company. While other companies such as Nokia and Apple innovated with clever interfaces, Microsoft has stuck with voice and their vision around it. With each new application release using their library, and the ensuing embarresing recognition results, the Microsoft bet on voice is looking less and less likely to pay off.

  • http://www.youbundle.com Neyma Jahansooz

    Like the photo says – It will be the Star Trek era upon us. where everything in our life is wired to the great UltraNET. Then voice recognition will be used so you can talk to your microwave.

  • http://www.techcrunch.com michael arrington

    here’s what I want – to sit in my car and tell my nav system where I want to go, and then the car drives me there. can i have that?

  • Venky

    You can have it and its called a CAB !! Go hire one…

  • http://www.techcrunch.com michael arrington

    Venky, good point.

  • http://www.twitter.com/youbundle Neyma Jahansooz

    like you can be sitting in your office and you speak – ‘2 double decker tacos and a chalupa’ then your car goes and drives itself to the Taco bell and the drive though deducts credits from your account. Then the car returns and your kitchen bot brings the tacos to your desk.

    But really – Voice in computers needs to be seamless- I need to be able to say in common speech what I am looking for and the computer recognizes with impunity and executes. This will actually be a major corner in the search market and probably why MSFT bought Powerset seeing their history with Voice and the realistic applications of NLP. (shh dont tell anybody)

  • http://www.texttechnologies.com/2008/07/07/techcrunchit-rants-against-voice-recognition/ TechCrunchIT rants against voice recognition | Text Technologies

    […] ranted yesterday against voice recognition. Parts of the argument have validity, but I think the overall argument was […]

  • http://crueltobekind.org Nicole Simon

    You are missing the point.

    There is no reason to bury it, but all the reasons in the world to finally nail it. I remember a presentation at IBM in 94 where a guy was handed a newspaper article of our choice, he read it and the system flawlessly transcribed it to German, grammer changes of word included.

    Dragon for German has an okay recognition far beyond 80% (but still annoying enough to not make me use it) while English is way harder since I have an all over the place accent in American, event though I have a voice for recognition (steady, quite clear etc).

    Yes, even as I am sitting typing while there are no colleagues around me in my office, I have noise in the background. Yes people will have problems with colleagues around them. But guess what: That does not hinder them for example to phone with other people.

    You are also wrong on the button – one app for the WIndows native speach recognition did react on a phrase like in star trek (which I found very clever) and for the moment I do like that I can switch it on an off easily with a button. After all, I hit buttons nearly blind on my keyboard as part of my training.

    Overall I find your conclusion “did not work in the past so let’s bury it” to be lacking not only of imagination but also of ideas. People still do speak much faster than they type – even if speaking slower – and ideas are much more free flowing when speaking than typing.

    And mobile devices are better off with keyboard input than voice – are you kidding me? There is a reason why mobile phone usage for example in cars is restricted to handless usage.

    Next thing you tell us is that computers are overrated and we should be going for typewriters …

  • http://www.techcrunchit.com/ Nik Cubrilovic

    Nicole: some valid points, but I find that I write a lot more clearly than I speak. You definitely do structure sentences differently when writing/speaking.

  • sam

    i agree that the technology is not well developed. but for people with injuries that prevent them to continue to make living working at the computer (i.e. me), the technology, however poor, is still a godsend compared to the alternative (change career all together or endure excruciating pain). so i’m definitely against abandoning trying all together, i just wish they would do a better job at it and figure out a way to account for different accents (which was one of my big problems when i started using voice recognition).

    btw, i use dragon naturally speaking which i find reasonably accurate.

  • http://blogs.msdn.com/robch Rob Chambers

    Nik, it sounds like the technology you used to evaluate Microsoft’s speech recognition, is almost as old as this quote from Bill Gates. The speech technology included in windows XP is almost eight years old.

    You should try upgrading to the latest windows Vista. Windows Vista includes the very latest in speech technology,  Windows Speech Recognition.

    By the way, I used speech recognition to dictate this entire post.

    Including my signature.

    Rob Chambers [MSFT]
    Windows Speech Recognition – We’re Listening…

    This posting is provided “AS IS” with no warranties, and confers no rights.

  • http://blogs.msdn.com/robch/archive/2008/07/07/say2go-awarded-1st-prize-in-isv-software-solutions-in-microsoft-s-2008-partner-contest.aspx Rob's Rhapsody : Say2Go awarded 1st Prize in ISV/Software Solutions in Microsoft's 2008 Partner Contest

    […] the good folks over at TechCrunch did a review of Say2Go (here) to which Nik at TechCrunchIT responded with his own discussion of "Will we ever bury Voice Recognition?" In that discussion, […]

  • http://www.hxa.name/ Harrison Ainsworth

    There are niches for speech but that is probably all.

    On its own, it is a very ‘low-dimensional’ interface — even worse than a command-line (which also has its niches but is not popular): you cannot see what you have just said, or what options are available (imagine issuing a non-simple command, thinking you have made a mistake, and wanting to re-edit it…).

    Unless speech-recognition is backed by powerful artificial intelligence it will be confined to that weakness, and such intelligence seems unavailable (and even then the basic ‘low-dimensionality’ remains.).


    on a global scale aren’t humans cheaper/more accurate at dictation? SR will only be usable once the equation dips to the other side

  • http://www.techcrunchit.com/ Nik Cubrilovic

    @Rob: oh c’mon, I just finished the Vista to XP upgrade, not going back just yet :)

    More seriously though, I am pretty sure that I got the latest SDK from your site and installed it – its the download that the Say2Go install defers the user to. If I was running an older version, it could only be because the install I did was either wrong or it was still calling an old version

    I shot you an email using your contact form – would be interesting in learning more about the Speech SDK and technologies at MSFT. I understand that it can be tricky to pick up Australian, or that it might need a bit more training, but I am open to working on it

  • Henry Work

    Nice article Nik

  • Norbert Perkins

    Don’t write off voice recognition just yet – there are plenty of small players out there doing very well with voice recognition. My personal favourite is Jott. It does a great job of transcribing my message – but a) it expects me to talk like a normal person (Not the way that Vista does it=PC operations into Text……File…Open??) and b) that output becomes actionable. Example: I call. Jott says “Who do you want to Jott?” I say “Google Calendar” – then i say ” Meet Susan for lunch tomorrow” -Result: I get an appointment in my Google calendar at 12 the next day – with Susan. Nice and simple. No need to boot up my PC or defy physics on my iPhone (big fingers, miniscule soft keys…) …..

  • http://www.commonsensepr.com Eric Eggertson

    Are there speech recognition programs out there that can do ANYTHING with 30 minutes of training? If so, that’s awesome.

    My spouse uses Dragon, and finds it very helpful. But when she loads it on a new machine, the IT people say she has to retrain from scratch. That can’t be right…

  • http://voicesage.blogspot.com Paul Sweeney

    The problem (IMHO) is that voice had to be hosted in the cloud to get the volume of interactions and sharing across individual users, and use contexts, to get the accuracy up. That’s the secret sauce in TellMe. Jott is indeed very cool, as is http://www.dial2do.com which is integrated with http://www.jajah.com so that you can call all your contacts, by name, from your car, on a low cost account. Its a tight mashup. People have been nothing but impressed by the quality of voice rec on the http://www.twitterfone.com service, itself a mashup of dial2do, maxroam, and zong. So, plenty going on out there.

  • http://www.itbusinessedge.com/blogs/cip/?p=382 Voice Recognition Strives for Success in the Hear and Now - Caller IP

    […] TechCrunch IT throws some cold water on the feel-good story of voice recognition, however. The writer does this through a close look at Say2Go, a new Windows-based IM client. Bill Gates, he says, predicted 11 years ago that voice recognition would be a common user interface technology by now. It hasn’t happened, and the writer points to Say2Go as an example of why the industry segment is struggling. The “training”– the reading of text into the device to give it a reference point – is time-consuming. Microsoft aims at 80 percent accuracy, but the writer says that that isn’t realistic. Say2Go, he said, made numerous mistakes and lost the meaning of whole sentences. […]

  • http://speech.even-zohar.com Itamar Even-Zohar

    I am sorry to say you are not really updated in SR. If you don’t like dictating that’s your legitimate preference of course, but you cannot deduce from your attitude that using SR for dictation is not practical. Other people think it is more practical, more convenient and more pleasurable than typing (while the disabled have got no choice anyway). Those who do prefer dictation (and controlling the computer partly or wholly by voice) now have at least two great applications available: WSR and DNS 9.5. With the brand-new macro feature in WSR, recently added, it can now also be expanded to creating new commands and shortcuts.

    Again, whether creating texts by voice make them more readable or not is a matter of evaluation. In my 40 years of experience, it definitely does, and this has been greatly confirmed in my personal experience since 1997, when SR had its real breakthrough.

    At any rate, before expressing learned opinions, I suggest you should at least know what’s up. I humbly recommend my own speech website (see above) for updated information and access to resources.

  • Antihadron

    Norbert, while there are numerous small players doing well at speech recognition, Jott is not one of them. Jott is an example of ‘hamsterware’, what you say is sent in audio to a real person (working in India or perhaps elsewhere). That person types in what you said and sends it back to you. While this is interesting, its not machine based speech recognition in my book.

  • http://www.jabbertags.com/popular/voicerecognition Recent Links Tagged With "voicerecognition" - JabberTags

    […] public links >> voicerecognition Will We Ever Bury Voice Recognition? Saved by Hopkins90 on Sun 02-11-2008 Speech Mashups: Ends the need for voice recognition software […]

  • http://www.1to1french.com french tutoring

    I like HP software for voice recognition, it is installed when you buy a new laptop, i use it….works great.

  • http://www.appspatrol.com iPhone Apps Review

    I think that voice recognition is great for people who have arm/hand trouble – from carpal tunnel all the way to total disability. I generally think it’s a good idea to innovate, and we’ll find a good use for it. Also, I do have this fantasy of treating my computer like a secretary. “Miss Macintosh, take a letter…”

blog comments powered by Disqus