Stanford quantifies the privacy-stripping power of metadata

More proof, if proof were needed, of the privacy-stripping power of metadata. A multi-year crowdsourced study, conducted by Stanford scientists and published this week, underlines how much information can be inferred from basic phone logs cross-referenced with other public datasets.

(Reminder: the former director of the NSA and the CIA, General Michael Hayden, has asserted: “We kill people based on metadata” — which suggests rock solid confident in the inferences that spy agencies are able to draw from metadata. Hayden reiterated this point in an on stage interview last week at TechCrunch Disrupt New York. “Metadata’s incredibly powerful,” he said. “Metadata shouldn’t get a free pass.”)

The research paper, entitled Evaluating the privacy properties of telephone metadata, details how the scientists investigated what they describe as the “factual assumptions that undergird policies of differential treatment for content and metadata”, underlining how easily they were able to generate detailed intelligence from metadata.

Their study is based on crowdsourced telephone metadata from more than 800 volunteers (using an Android app to pull the relevant metadata off the participants’ phones) cross-referenced with social networking information and other public data sets, such as Yelp and Google Places.

“We find that telephone metadata is densely interconnected, susceptible to reidentification, and enables highly sensitive inferences,” the paper authors write.

State surveillance activity typically involves a legal distinction being made between accessing the contents of communications vs harvesting ‘only’ the communications metadata, with tighter restrictions on accessing content vs accessing metadata. However given how much can be inferred from metadata there is a growing case for more stringent controls on how metadata can be used. And how wide the net should be cast.

The UK government appointed independent reviewer of terrorism legislation, David Anderson, noted in his review of investigatory powers last year that “the distinction between “content data” and metadata …is rapidly fading away in modern network environment” — quoting that conclusion from a prior EU-funded Surveille Report.

While, in the US post the 2013 Snowden disclosures, NSA analysts who were previously allowed to look at data up to three hops away from a target individual — where one hop might be a phone call from the individual’s phone to another number — are now restricted to two hops. The data collection windows has also been proposed to be shrunk from five years to up to 18 months.

“Over a large sample of telephone subscribers, over a lengthy period, it is inevitable that some individuals will expose deeply sensitive information. It follows that large-scale metadata surveillance programs, like the NSA’s, will necessarily expose highly confidential information about ordinary citizens,” the Stanford scientists argue in their paper.

In their estimation the reach of the NSA’s metadata surveillance program prior to 2013 (when analysts were able to perform three hops) would have given them “legal authority to access telephone records for the majority of the entire US population”.

“Under the more recent two-hop rule, the proposed 18-month retention period, and an assumption that national and local hub numbers are removed from the call graph, an analyst could in expectation access records for ∼25,000 subscribers with a single seed,” they add.

The study underlines quite how much can be inferred when you are harvesting metadata from even a relatively small group of people. The metadata they gathered on the 823 volunteer study participants covered around 250,000 calls and more than 1.2 million texts — clearly a drop in the ocean of the mass surveillance programs operated by state security agencies, yet they were still able to glean a lot of information.

For example, the researchers found it was trivially easy to reidentify a person whose name they did not know if they had the person’s telephone number. “We conducted both automated and manual attempts at reidentification, and we found that both approaches were highly successful,” they write.

They were also able to predict location, based on the location of businesses people telephoned (and using public data sources to match businesses with phone numbers) — correctly predicting the “Facebook current city” of a majority (57%) of the study participants.

The researchers also built a classifier for whether someone was in a relationship, based on their call and text records, and once they had labeled a person as being in a relationship identifying their partner was “trivial” — again from the metadata.

“It appears feasible — with further refinement — to draw Facebook-quality relationship inferences from telephone metadata,” they write.

They also found they could draw even more highly sensitive inferences from metadata — connecting the dots from a series of phone calls to infer that one participant might have multiple sclerosis, for example, and that another might have a specific heart condition, and that a third may be involved with growing cannabis, and that a fourth might own a semiautomatic rifle, and that a fifth might be pregnant.

“Using public sources, we were able to confirm that participant B had a cardiac arrhythmia and participant C owned an AR rifle. As for the remaining inferences, regardless of whether they were accurate, the mere appearance of possessing a highly sensitive trait assuredly constitutes a serious privacy impact.”

Summing up their findings, they argue there are “significant privacy impacts” from using telephone metadata for surveillance purposes — and call for law and policy in this area to be underpinned by quantitative and scientific analysis, rather than “assumption and conventional wisdom”.

They write:

Telephone metadata is densely interconnected, easily reidentifiable, and trivially gives rise to location, relationship, and sensitive inferences. In combination with independent reviews that have found bulk metadata surveillance to be an ineffective intelligence strategy, our findings should give policymakers pause when authorizing such programs. More broadly, this project emphasizes the need for scientifically rigorous surveillance regulation. Much of the law and policy that we explored in this research was informed by assumption and conventional wisdom, not quantitative analysis. To strike an appropriate balance between national security and civil liberties, future policymaking must be informed by input from the relevant sciences.

The scientists also flag up privacy concerns about commercial data practices, noting it is routine practice for telecommunications firms to “collect, retain, and transfer subscriber telephone records” — arguing that telecoms regulations should therefore also incorporate “a scientifically rigorous understanding of the privacy properties of these data”.