Amazon And The NIH Team Up To Put Human Genome In The Cloud

Amazon and the U.S. National Institutes of Health (NIH) announced today that the complete 1000 Genomes Project is being made available on Amazon Web Services as a public data set. The announcement, made at the White House Big Data Summit, will make the largest collection of human genetics available to anyone free of charge.

In case you’re light on the details, the 1000 Genomes Project is an international research effort started in 2008 that involves 75 companies and organizations working together to create a detailed catalog of the human genome, and all its 3 billion DNA bases. To date, over 200 terabytes of data have been created since the project’s start.

There’s now DNA sequenced from over 2,661 individuals from 26 populations, and the NIH is planning to add more samples this year. The effort led to the techniques used to sequence the DNA of other species, going from the mouse to the gorilla.

The project started off with three pilot studies. Amazon began hosting the initial pilot data on Amazon S3 in 2010, so it’s not surprising to see the remainder of the data added today. The latest dataset is the most current, containing the DNA of 1,700 people.

The move to put the data up on Amazon, specifically, Amazon Web Services, aims to help speed up access to the research. Previously, researchers had to download data from government data centers or their own systems, or even snail mail it on discs.

The data will be stored on Amazon Simple Storage Service (Amazon S3) and Amazon Elastic Block Store (Amazon EBS) and can be accessed from AWS services such as Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Elastic MapReduce (Amazon EMR).

The 1000 Genomes Project is only one of many of the publicly hosted datasets found on Amazon. Others include data from NASA’s Jet Propulsion Laboratory, Langone Medical Center at New York University, Unilever, Numerate, Sage Bionetworks and Ion Flux.

More details on the data itself are here.