A cautionary tale about humans creating biased AI models

Most artificial intelligence models are built and trained by humans, and therefore have the potential to learn, perpetuate and massively scale the human trainers’ biases. This is the word of warning put forth in two illuminating articles published earlier this year by Jack Clark at Bloomberg and Kate Crawford at The New York Times.

Tl;dr: The AI field lacks diversity — even more spectacularly than most of our software industry. When an AI practitioner builds a data set on which to train his or her algorithm, it is likely that the data set will only represent one worldview: the practitioner’s. The resulting AI model demonstrates a non-diverse “intelligence” at best, and a biased or even offensive one at worst.

The articles focus on two related areas in which diversity and demographics matter when it comes to building AI: the data scientist, and the data scientist’s choices for training data. Again, the theory is that though it’s subconscious, the practitioner’s selection of training data — say, images of peoples’ eyes or tweets in English — reflect the types of objects, experiences, etc. with which the practitioner is most familiar (perhaps images of a particular demographics’ eyes, or tweets written in British English).

There’s a third area in which demographics and diversity matter, though. It’s just as important, and it’s often overlooked — it’s the annotators.

Many people = many (varying) viewpoints

Data used for training AI and machine learning models must be labeled — or annotated — before it can be fed into the algorithm. For instance, computer vision models need annotations describing the categories to which images belong, the objects within them, the context in which the objects appear and so on.

We need to remain keenly aware of what makes us all, well… human.

Natural language models need annotations that teach the models the sentiment of a tweet, for example, or that a string of words is a question about the status of an online purchase. Before a computer can know or “see” these things itself, it must be shown many confident positive and negative examples (aka ground truth or gold standard data). And you can only get that certainty from the right human annotators.

So what happens when you don’t consider carefully who is annotating the data? What happens when you don’t account for the differing preferences, tendencies and biases among varying humans? We ran a fun experiment to find out.

Gender makes a significant difference

Actually, we didn’t set out to run an experiment. We just wanted to create something fun that we thought our awesome tasking community would enjoy. The idea? Give people the chance to rate puppies’ cuteness in their spare time. While we design all of our tasks to be fun and engaging, they still require smarts and skills, and we figured it would be cool of us to throw in some just-for-smiles tasks. An adorable little brain break, if you will.

And so we set up a “Rate the Puppies” task, served users puppy pics and asked them to rate each pooch’s cuteness on a scale of 1 to 5 stars. Everyone loved it. Including us. Duh! We love dogs! (Also cats! We love cats, too. And cat people. For the record.) But when we analyzed the data, one thing immediately jumped out: On average, women gave higher cuteness ratings — a statistically significant 0.16 stars higher.


There was a clear gender gap — a very consistent pattern of women rating the puppies as cuter than the men did. The gap between women’s and men’s ratings was more narrow for the “less-cute” (ouch!) dogs, and wider for the cuter ones. Fascinating.

I won’t even try to unpack the societal implications of these findings, but the lesson here is this: If you’re training an artificial intelligence model — especially one that you want to be able to perform subjective tasks — there are three areas in which you must evaluate and consider demographics and diversity:

  • yourself
  • your data
  • your annotators

This was a simple example: binary gender differences explaining one subjective numeric measure of an image. Yet it was unexpected and significant. As our industry deploys incredibly complex models that are pushing to the limit chip sets, algorithms and scientists, we risk reinforcing subtle biases, powerfully and at a previously unimaginable scale. Even more pernicious, many AIs reinforce their own learning, so we need to carefully consider “supervised” (aka human) re-training over time.

Artificial intelligence promises to change all of our lives — and it already subtly guides the way we shop, date, navigate, invest and more. But to make sure that it does so for the better, all of us practitioners need to go out of our way to be inclusive. We need to remain keenly aware of what makes us all, well… human. Especially the subtle, hidden stuff.