Last year, voice technology giant Nuance quietly acquired VirtuOz, a developer of virtual assistants for online sales, marketing and support — a “Siri for the enterprise” that counted the likes of PayPal and AT&T as customers. Now, Alexandre Lebrun, the founder and CEO of VirtuOz, has taken a dive back into the startup world to launch Wit.ai, a platform and API that will let a developer incorporate speech recognition and a natural language interface into any app or piece of hardware.
In Lebrun’s words, the idea here is to apply, effectively, a “Twilio or Stripe model” to the world of voice interfaces, where Wit is able to understand the intent of users, as well voice recognition.
Developers who want to incorporate this into their apps entering a few lines of Wit.ai code; for the first time, the developers themselves do not have to be experts in the field, or face the prospect of huge expense to bring in that technical knowledge from elsewhere.
In keeping with its startup heroes Twilio and Stripe, Lebrun and his co-founder Willy Blandin applied to and are now part of the current Y-Combinator class. Outside of that, Wit.ai has been making some impressive progress, too.
Within two months of opening up a beta of its service and just on word of mouth, it has picked up registrations from 2,000 developers; and it has several big customers already signed including a major device maker, a car company, a TV channel and several home automation startups. (I’ve agreed not to mention any of them by name for now but I promise you it’s a commendable list.) Lebrun and his small team currently work out of smartwatch maker Pebble’s offices in Palo Alto.
As Lebrun describes it to me, the impetus for launching Wit.ai as a new startup came from two places.
The first is that an API-based natural language platform is still a very new concept. That’s because platforms involving natural voice have been very tricky to create as anything other than bespoke applications specifically tailored to particular use-cases. That has effectively meant that voice interfaces for most apps would have been too cost-prohibitive and time-prohibitive for most app publishers.
The developer demand and the challenge of meeting it were both things that Lebrun encountered regularly when he was still running VirtuOz.
“When someone would ask us to build an intelligent agent these typically cost $100,000 to build and three months of work,” he said. “It’s very technical. You have to consider grammar, specific terms, different use cases.”
And yet, for the new generation of devices, natural language is going to be the most natural way to interact.
“Steve Jobs put Siri on the iPhone, but is it absolutely necessary? Most people don’t like it because it doesn’t really do anything that you can’t already do with other functions on the phone,” he said.
You could also argue that this could partly be behind why even Apple hasn’t pushed to make it a far stronger feature than it is today. It almost feels as if having it there with people getting used to the idea of it is enough.
In contrast, Lebrun says that Wit.ai focuses on how developers will be able to create for the next generation of devices, “where you don’t have that keyboard. Think about Nest or Google Glass or many other wearables. There is no alternative to voice.”
The second reason for the creation of Wit.ai has to do with wider questions of innovation. Lebrun was already in a major voice technology company, so why not just launch Wit.ai as a project there?
He tells me that apart from the fact that larger companies can be very resistant to technologies that potentially cannibalise their existing revenues, he feels like large, older organisations can often simply be very challenging for incubating and growing new concepts.
“If you asked why Nuance could not develop this, I would respond with, why didn’t AT&T create Twilio? It just has to be an outsider,” Lebrun explained.
And to be sure, Wit.ai is not alone here: there are other natural language and speech recognition specialists that you can imagine may also be looking at how they can create the equivalent of an API for Siri-like technology. Potential candidates include Robin Labs, or Amazon and more Amazon, Nuance and Intel.
What Wit.ai is doing on a technical level taps very much into the benefits and advances brought by big data architecture.
Wit.ai has incorporated several language processing engines that run in parallel, including the open source CMU Sphinx project developed at Carnegie Mellon. Using machine learning, Wit.ai can intelligently combine the results and cover general language as well as specialised vocabularies. Lebrun describes Wit.ai as a “virtual layer on top of them” that weaves the different services into a single cloth. In that sense it differs from another company, Ask Ziggy, that has also created an API to put natural language understanding into apps, but doesn’t provide the processing engine integration.
Then, there is an element of crowdsourced data thrown into the mix. Wit.ai draws on a database of commands that are effectively an aggregation of all of the different existing and additional phrases that developers that use its API may have entered for their particular apps — they have the option to do this, although it is not necessary.
“While natural language is hard, there is also a lot of overlap in what people want,” he says. “When each developer gives some examples, Wit understands and connects that example to others like it, and those new phrases become part of the bigger dataset.”
This has been a hit with developers.
“Wit’s uncanny ability to turn free form language into structured data has transformed TalkTo’s ability to get answers from businesses by helping us understand the meaning of each question,” said Riley Crane, co-founder and CTO of TalkTo, a startup that whose free app lets users get answers from local stores without calling them. “Their API provides a new and much needed superpower for developers by forming a bridge between the unstructured world that we humans live in and the rigid exacting world of machine code.”
At its heart, Wit.ai is trying to disrupt the siloed way that natural language and speech recognition technology has been developed over the years. “This is a revolution in the voice and language industry,” says Lebrun.
Here’s a video of how the technology works in action: