Synthetic data set of human trafficking victims could allow big data work without privacy compromises

In order to combat human trafficking effectively, those combating it must understand it — and these days, that means data. Unfortunately, for obvious reasons there is no convenient index of trafficking victims, though this confidential information is in some ways abundant. Microsoft and the International Organization for Migration may have found a way forward with a new synthetic database that has all the important characteristics of the real trafficking data, but is completely artificial.

While each victim is unquestionably individual, basic high-level questions like which countries are increasingly the source or means of trafficking, which routes and methods are used, and where the victims end up are a matter of statistics. The evidence to identify trends and patterns, crucial to prevention, is locked up in thousands of these individual stories that most would prefer not to publicize.

“Administrative data on identified cases of human trafficking represent one of the main sources of data available but such information is highly sensitive,” said IOM program coordinator Harry Cook in a news release describing the data set. “IOM has been delighted to work with Microsoft Research over the past two years to make progress on the critical challenge of sharing such data for analysis while protecting the safety and privacy of victims.”

Historically, for things like crime databases and medical info, the strategy is to redact liberally, but this method of “de-anonymizing” has been shown to be ineffective against any serious attempt to reconstruct the data. With numerous databases public and leaked and computing power on tap, the redacted information can be supplied quite reliably.

The option taken by Microsoft Research is to use the original data as the basis for a synthetic data set that retains all the important statistical relationships of the source but none of the identifiable information. And it’s not just turning “Jane Doe” into “Janet Doeman” and her hometown from Cleveland to Queens. Instead, groups of no less than 10 people with similar or overlapping data are merged to create a set of attributes that accurately represent them statistically but can’t be used to identify them individually.

Caption: Statistics relating to human trafficking around the world.

Image Credits: Microsoft Research / IOM

Naturally this doesn’t have the granularity of the original data, but unlike the sensitive source, this data can actually be used. It’s not necessarily for some task force to analyze and say “okay the next smuggling operation will be based out of…” but rather this data, based in firsthand evidence, can be pointed at as a factual record for addressing this at a policy and diplomacy level. Where before one may have had to say in a more general way that Country X or Government Z was neglectful or complicit in these matters, having hard data to back that up allows one to say “36 percent of sex trafficking victims pass through your jurisdiction.”

Not that the data has to be used in strongarm tactics — simply understanding the global trade in human misery as a system and not just a series of disconnected events is valuable in and of itself. You can peruse the data and request to use it here, and learn more about the process for creating it at the program’s GitHub.