AI2’s Semantic Scholar expands to cover 175 million papers in all scientific disciplines

There are a lot of scientific papers out there, and finding the right ones, or the right connections between them, can be extremely difficult. Semantic Scholar uses AI to understand and index journal articles, but until recently has been limited to a handful of topics. It has now expanded to cover practically every branch of science — and some 175 million papers.

I covered Semantic Scholar, a project of the Allen Institute for AI, when it first launched in 2016, at which time it had only indexed papers in computer science and neuroscience. The next year, it added biomedical papers covering a variety of sub-topics.

The problem they are attempting to solve is simply that there’s too much information for academics to parse. And while they may do their best to keep up with the literature, a key insight or relevant result may be hidden away in an obscure journal that only gets the vaguest reference in a citation or review.

“We created it because of information overload in science,” explained project head Doug Raymond in an interview. “The focus of the team was, how do we make science more discoverable?”

Semantic Scholar uses natural language processing to get the gist of a paper, understand what processes, chemicals, or results are described, and make that information easily searchable. Not only does it make finding literature relevant to a given topic easier, but it can establish patterns and find connections that might not have been clear before.

For instance it may be possible using the platform to identify trends in authorship as far as gender and other demographic balance (work on this is under way), or find bad actors who systematically cite themselves. In other cases the trends may be more immediately relevant: the majority of patients with kidney diseases are female, but the majority of those used in studies are male.

That’s not to say the system is doing research by itself, but facts and trends can appear under this kind of analysis that might have remained dormant otherwise. Especially since the system now encompasses most scientific domains and can make those connections between them as well as within them.

Expanding from a handful of disciplines to practically all of them was not an easy process, though the challenges are not what you might guess.

“We found that most of our models generalize well to new domains of science,” said Raymond. “That said, there’s always room for improvement. Some domains have very different conventions in how they write abstracts or lay out tables.”

The language model they created, SciBERT (an evolution of BERT, a more general purpose NLP agent), has been tweaked to understand different types of notation and so on. But apparently it didn’t choke, as I would have, after learning on CS and moving to organic chemistry. The results are functional enough to package into something like

Raymond said the biggest problem was the more prosaic challenge of improving the system’s infrastructure to support the increased volume of data.

“The hardest thing, I’d say, was moving to a data pipeline that’s real-time and instantaneous rather than batch processing them,” Raymond explained. “Once we got to this scale, with the number of papers and partners, we had to redo the pipeline to get things done in hours rather than days.”

More partners means working with major science publishing outfits like Elsevier and Nature, which with the threat of SciHub and pressure from academics to move toward open access models, feel the presence of both stick and carrot as far as working with new efforts like Semantic Scholar.

As it is, the system has ingested most of the open access literature out there, and also has the key information for papers behind paywalls — users just won’t be able to pull up the full document without paying. On the other side of the equation, a partnership with Unpaywall keeps links to open access papers up to date. Open access articles, the platform has happened to note, are a rapidly increasing proportion of all articles: more than doubling, from something over 10% to just under 30% in the last decade.

Now that the expansion part is mostly complete, the Semantic Scholar team is working on a few new features: improved summaries of articles, domain-specific functions and a feed view that could show, say, a cell biologist the latest and most relevant findings in their field without exposing them to the firehose of research constantly being published.

Semantic Scholar is free to use — you can find it here.