When Data Goes Bad

So I know this guy Sulemaan from my Toronto days. Really good guy, despite being a Spurs fan. Sulemaan has a son, Syed, who is flagged as a security risk, a suspected terrorist, every time he flies. Syed is six years old. This is, of course, completely insane. But the data has been parsed; the algorithm has spoken; and so others must suffer from the idiocy of those who built the system.

Sulemaan tweeted about this on December 31st, while on his way to watch the Winter Classic in Boston (like me, he’s a Habs fan, which almost makes up for his Spurs heresy) and got an impressive amount of international media attention, which led to the discovery that dozens of families have similar stories, and a pledge to review the “no-fly list” by Canada’s minister of public safety.

Which actually infuriates me even more than the original sin. How can a government (and the airline partners, in this case) construct an automated system that makes sweeping, important decisions about people, and not build in scope for appeals, errors, and exceptions from the word go? How can anyone be so stupid, so short-sighted, as to treat their data and their algorithms as infallible?

…But we trust algorithms over humans, because we believe we’re trusting infallible math, rather than the human who wrote the code that implements the algorithm. We believe in “data-driven decisions,” because we ignore the sad fact that bad data is often actually worse than no data, often in very subtle, and yet deeply malicious, ways:

The very fact that we were using historical data meant that we were “training our model” on data that was surely biased, given the history of racism. And since an algorithm cannot see the difference between patterns that are based on injustice and patterns that are based on traffic, choosing race as a characteristic in our model would have been unethical. Looking at old data, we might have seen that families with a black head of household was less likely to get a job, and that might have ended up meaning less job counseling for current black homeless families.

to quote a Slate piece on “The Ethical Data Scientist.” This is just as true of machine learning:

…and of neural networks, whose categorization decisions are often literally inexplicable–and yet at the same time, one can, relatively easy, trick a neural network into thinking a panda is a vulture, or a cat is a bath towel.

And that’s without even considering datasets whose very collection is clearly malicious, or, at least, a bad idea. Consider “the Internet of things that talk about you behind your back.” Consider the “social credit” scores that China is considering assigning to every one of its citizens — based in part on who their online and offline friends are — and the ramifications thereof. (And, as a tangential aside, teaching robots how to be deceptive (PDF) just seems like bad strategy for the human race. Just sayin’.)

I’m not saying that collecting more data is bad. Quite the opposite; I believe that as many datasets as possible should be made public, as long as we can be confident that they are sufficiently anonymized. (And I think the notion that scientific data shouldn’t be shared because, and I quote the New England Journal of Medicine here, people might “use the data to try to disprove what the original investigators had posited” deserves all the contempt we can collectively aim at it.)

What I am saying is that most datasets should be anonymized, and when they can’t be, especially when decisions are made based on that data, there should be ample scope for, and easy access to, appeals, errors, exceptions, and above all, human judgement. You wouldn’t think this would be a contentious or controversial point of view. But, whether out of intellectual laziness or the belief in data and algorithms as a cheap panacea, it seems to be. Let us please remind ourselves: these are tools, not solutions.