Machines learn language better by using a deep understanding of words

Computer systems are getting quite good at understanding what people say, but they also have some major weak spots. Among them is the fact that they have trouble with words that have multiple or complex meanings. A new system called ELMo adds this critical context to words, producing better understanding across the board.

To illustrate the problem, think of the word “queen.” When you and I are talking and I say that word, you know from context whether I’m talking about Queen Elizabeth, or the chess piece, or the matriarch of a hive, or RuPaul’s Drag Race.

This ability of words to have multiple meanings is called polysemy. And really, it’s the rule rather than the exception. Which meaning it is can usually be reliably determined by the phrasing — “God save the queen!” versus “I saved my queen!” — and of course all this informs the topic, the structure of the sentence, whether you’re expected to respond, and so on.

Machine learning systems, however, don’t really have that level of flexibility. The way they tend to represent words is much simpler: it looks at all those different definitions of the word and comes up with a sort of average — a complex representation, to be sure, but not reflective of its true complexity. When it’s critical that the correct meaning of a word gets through, they can’t be relied on.

ELMo (“Embeddings from Language Models”), however, lets the system handle polysemy with ease; as evidence of its utility, it was awarded best paper honors at NAACL last week. At its heart it uses its training data (a huge collection of text) to determine whether a word has multiple meanings and how those different meanings are signaled in language.

For instance, you could probably tell in my example “queen” sentences above, despite their being very similar, that one was about royalty and the other about a game. That’s because the way they are written contain clues to your own context-detection engine to tell you which queen is which.

Informing a system of these differences can be done by manually annotating the text corpus from which it learns — but who wants to go through millions of words making a note on which queen is which?

“We were looking for a method that would significantly reduce the need for human annotation,” explained Mathew Peters, lead author of the paper. “The goal was to learn as much as we can from unlabeled data.”

In addition, he said, traditional language learning systems “compress all that meaning for a single word into a single vector. So we started by questioning the basic assumption: let’s not learn a single vector, let’s have an infinite number of vectors. Because the meaning is highly dependent on the context.”

ELMo learns this information by ingesting the full sentence in which the word appears; it would learn that when a king is mentioned alongside a queen, it’s likely royalty or a game, but never a beehive. When it sees pawn, it knows that it’s chess; jack implies cards; and so on.

An ELMo-equipped language engine won’t be nearly as good as a human with years of experience parsing language, but even working knowledge of polysemy is hugely helpful in understanding a language.

Not only that, but taking the whole sentence into account in the meaning of a word also allows the structure of that sentence to be mapped more easily, automatically labeling clauses and parts of speech.

Systems using the ELMo method had immediate benefits, improving on even the latest natural language algorithms by as much as 25 percent — a huge gain for this field. And because it is a better, more context-aware style of learning, but not a fundamentally different one, it can be integrated easily even into existing commercial systems.

In fact, Microsoft is reportedly already using it with Bing. After all, it’s crucial in search to determine intention, which of course requires an accurate reading of the query. ELMo is open source, too, like all the work from the Allen Institute for AI, so any company with natural language processing needs should probably check this out.

The paper lays down the groundwork of using ELMo for English language systems, but because its power is derived by essentially a close reading of the data that it’s fed, there’s no theoretical reason why it shouldn’t be applicable not just for other languages, but in other domains. In other words, if you feed it a bunch of neuroscience texts, it should be able to tell the difference between temporal as it relates to time and as it relates to that region of the brain.

This is just one example of how machine learning and language are rapidly developing around each other; although it’s already quite good enough for basic translation, speech to text and so on, there’s quite a lot more that computers could do via natural language interfaces — if they only know how.