Amidst narratives of machine learning complacency, Apple is coming to terms with the fact that not talking about innovation means innovation never happened.
A detailed blog posting in the company’s machine learning journal makes public the technical effort that went into its “Hey Siri” feature — a capability so banal that I’d almost believe Apple was trying to make a point with highbrow mockery.
Even so, it’s worth taking the opportunity to explore exactly how much effort goes into the features that do, for one reason or another, go unnoticed. Here are five things that make the “Hey Siri” functionality (and competing offerings from other companies) harder to implement than you’d imagine, and commentary on how Apple managed to overcome the obstacles.
It had to not drain on your battery and processor all day
At its core, the “Hey Siri” functionality is really just a detector. The detector is listening for the phrase, ideally using fewer resources than the entirety of server-based Siri. Still, it wouldn’t make a lot of sense for this detector to even just suck on a device’s main processor all day.
Fortunately, the iPhone has a smaller “Always On Processor” that can be used to run detectors. At this point in time, it wouldn’t be feasible to smash an entire deep neural network (DNN) onto such a small processor. So instead, Apple runs a tiny version of its DNN for recognizing “Hey Siri.”
When that model is confident it has heard something resembling the phrase, it calls in backup and has the signal captured analyzed by a full-size neural network. All of this happens in a split second, such that you wouldn’t even notice it.
All languages and ways of pronouncing “Hey Siri” had to be accommodated
Deep learning models are hungry and suffer from what’s called the cold start problem — the period of time where a model just hasn’t been trained on enough edge cases to be effective. To overcome this, Apple got crafty and pulled audio of users saying “Hey Siri” naturally and without prompting, before the Siri wake feature even existed. Yeah I’m with you, this is weird that people would attempt to have real conversations with Siri, but crafty nonetheless.
These utterances were transcribed, spot checked by Apple employees and combined with general speech data. The aim was to create a model robust enough that it could handle the wide range of ways in which people say “Hey Siri” around the world.
Apple had to address the pause people would place in-between “Hey” and “Siri” to ensure that the model would still recognize the phrase. At this point, it became necessary to bring other languages into the mix — adding in examples to accommodate everything from French’s “Dis Siri” to Korean’s “Siri 야.”
It couldn’t get triggered by “Hey Seriously” and other similar but irrelevant terms
It’s obnoxious when you are using an Apple device and Siri activates without intentional prompting, pausing everything else — including music. The horror! To fix this, Apple had to get intimate with the voices of individual users.
When users initiate Siri, they say five phrases that each begin with “Hey Siri.” These examples get stored and thrown into a vector space with another specialized neural network. This space allows for the comparison of phrases said by different speakers. All of the phrases said by the same user tend to be clustered and this can be used to minimize the likelihood that one person saying “Hey Siri” in your office will trigger everyone’s iPhone.
And worst-case scenario, the phrase passes muster locally and still really isn’t “Hey Siri;” it gets one last vetting from the main speech model on Apple’s own servers. If the phrase is found to not be “Hey Siri,” everything immediately gets canceled.
Activating Siri had to be just as easy on the Apple Watch as the iPhone
The iPhone might seem limited in horsepower when compared to Apple’s internal servers, but the iPhone is a behemoth when compared to the Apple Watch. The watch runs a distinct model for detection that isn’t as large as the full neural network running on the iPhone or as small as the initial detector.
Instead of always running, this mid-sized model only listens for the “Hey Siri” phrase when a user raises their wrist to turn the screen on. Because of this and the ensuing potential delay in getting everything up and running, the model on the Apple Watch is specifically designed to accommodate variations of the target phrase that are missing the initial “H” sound.
It had to work in noisy rooms
When evaluating its detector, Apple uses recordings of people saying “Hey Siri” in a variety of situations — in a kitchen, car, bedroom, noisy restaurant, up close and far away. The data collected is then used for benchmarking accuracy and further tuning the thresholds that activate models.
Unfortunately, my iPhone still doesn’t understand context and Siri was triggered so many times while I was proofreading this piece aloud that I tossed my phone across the room.