MLCommons debuts with public 86,000-hour speech data set for AI researchers

If you want to make a machine learning system, you need data for it, but that data isn’t always easy to come by. MLCommons aims to unite disparate companies and organizations in the creation of large public databases for AI training, so that researchers around the world can work together at higher levels, and in doing so advance the nascent field as a whole. Its first effort, the People’s Speech Dataset, is many times the size of others like it, and aims to be more diverse as well.

MLCommons is a new nonprofit related to MLPerf, which has collected input from dozens of companies and academic institutions to create industry-standard benchmarks for machine learning performance. The endeavor has met with success, but in the process the team encountered a paucity of open data sets that everyone could use.

If you want to do an apples-to-apples comparison of a Google model to an Amazon model, or for that matter a UC Berkeley model, they really all ought to be using the same testing data. With computer vision one of the most widespread data sets is ImageNet, which is used and cited by all the most influential papers and experts. But there’s no such data set for, say, speech to text accuracy.

“Benchmarks get people talking about progress in a sensible, measurable way. And it turns out that if the goal is to move the industry forward, we need data sets we can use — but lots of them are difficult to use for licensing reasons, or aren’t state of the art,” said MLCommons co-founder and executive director David Kanter.

Certainly the big companies have enormous voice data sets of their own, but they’re proprietary and perhaps legally restricted from being used by others. And there are public data sets, but with only a few thousand hours their utility is limited — to be competitive today one needs much more than that.

“Building large data sets is great because we can create benchmarks, but it also moves the needle forward for everyone. We can’t rival what’s available internally but we can go a long way towards bridging that gap,” Kanter said. MLCommons is the organization they formed to create and wrangle the required data and connections.

The People’s Speech Dataset was assembled from a variety of sources, with about 65,000 of its hours coming from audiobooks in English, with the text aligned with the audio. Then there are 15,000 hours or so sourced from around the web, with different acoustics, speakers and styles of speech (for example conversational instead of narrative). In addition, 1,500 hours of English audio were sourced from Wikipedia, and then 5,000 hours of synthetic speech of text generated by GPT-2 were mixed in (“A little bit of the snake eating its own tail,” joked Kanter). Fifty-nine languages in total are represented in some way, though as you can tell it is mostly English.

Although diversity is the goal — you can’t build a virtual assistant in Portuguese from English data — it’s also important to establish a baseline for what’s needed for present purposes. Is 10,000 hours sufficient to build a decent speech-to-text model? Or does having 20,000 available make development that much easier, faster or effective? What if you want to be excellent at American English but also decent with Indian and English accents? How much of those do you need?

The general consensus with data sets is simply “the larger the better,” and the likes of Google and Apple are working with far more than a few thousand hours. Thus the 86,000 hours in this first iteration of the data set. And it is definitely the first of many, with later versions due to branch out into more languages and accents.

“Once we verify we can deliver value, we’ll just release and be honest about the state it’s in,” explained Peter Mattson, another co-founder of MLCommons and currently head of Google’s Machine Learning Metrics Group. “We also need to learn how to quantify the idea of diversity. The industry wants this; we need more data set construction expertise — there’s tremendous ROI for everybody in supporting such an organization.”

The organization is also hoping to spur sharing and innovation in the field with MLCube, a new standard for passing models back and forth that takes some of the guesswork and labor out of that process. Although machine learning is one of the tech sector’s most active areas of research and development, taking your AI model and giving to someone else to test, run, or modify isn’t as simple as it ought to be.

Their idea with MLCube is a wrapper for models that describes and standardizes a few things, like dependencies, input and output format, hosting and so on. AI may be fundamentally complex, but it and the tools to create and test it are still in their infancy.

The data set should be available now, or soon, from MLCommons’ website, under the CC-BY license, allowing for commercial use; a few reference models trained on the set will also be released.