Yahoo Releases Its Biggest-Ever Machine Learning Dataset To The Research Community

Yahoo announced this morning that it’s making the largest-ever machine learning dataset available to the academic research community through its ongoing program, Yahoo Labs Webscope. The new dataset measures a whopping 13.5 TB (uncompressed) in size, and consists of anonymized user interaction data. Specifically, it contains interactions from about 20 million users from February 2015 through May 2015, including those that took place on the Yahoo homepage, Yahoo News, Yahoo Sports, Yahoo Finance, and Yahoo Real Estate.

In addition to the user interaction data, the dataset also includes demographic information like age range, gender, and generalized geographic data, while items in the dataset include title, summary, and key phrases of the news article in question, plus local timestamps, and partial device information.

Explains Suju Rajan, Director of Personalization Science at Yahoo Labs, “Data is the lifeblood of research in machine learning. However, access to truly large-scale datasets is a privilege that has been traditionally reserved for machine learning researchers and data scientists working at large companies – and out of reach for most academic researchers.”

As you may imagine, the inability to test against “real-world” data can hamper innovation. And, in turn, can slow down progress.


Researchers at Carnegie Mellon University, the University of California in San Diego, and the UMass Amherst Center for Data Science have already stated that they’ll be using the newly released dataset in their own studies. For example, at CMU, researchers will be able to study how to automatically discover which news articles are of interest to which users, noted machine learning department chair, Tom Mitchell.

Rajan also says that those at Yahoo Labs have used sizable datasets like this to work on large-scale machine learning problems that are inspired by consumer-facing products, in particular in areas like search ranking, computational advertising, information retrieval, and core machine learning.

However, Yahoo wanted to “level the playing field between industrial and academic research,” which is why it’s releasing the new dataset to the wider community, she adds.

While it’s certainly welcome news to see a large contribution from Yahoo to the machine learning community, it’s not an entirely altruistic move. Yahoo’s larger goal here is advancing the study of machine learning – something that grew out of AI research and now focuses on developing algorithms that can learn and make predictions by using data. But if it succeeds by enabling researchers to accelerate the pace of innovation, Yahoo, too, will benefit by being able to take those learnings and apply them to its own products.

Yahoo, of course, is not the only major tech company making large-scale contributions like this, either. In November, Google open-sourced the machine learning technology TensorFlow, which powers Google Photos search, Gmail’s “Smart Reply,” speech recognition in the Google app, and more. In addition, IBM Watson, Amazon Machine Learning, and Azure Machine Learning are other notable names in the space.

Yahoo’s Webscope program is not new, and already offers a number of datasets comprised of anonymized user data for non-commercial use. However, this 13.5 TB machine learning data dump is its largest to date. Other datasets available on its site can be measured in GB’s, not TB’s, like the over 50 GB dataset that contains a sample of pages with HTML forms.

“Access to datasets of this size is essential to design and develop machine learning algorithms and technology that scales to truly ‘big’ data,” said Gert Lanckriet, professor, Department of Electrical and Computer Engineering, University of California, San Diego, in a statement. “At the Jacobs School of Engineering at UC San Diego, it will directly and significantly benefit the wide variety of ongoing research in machine learning, artificial intelligence, information retrieval, and big data applications.”