Using data science to beat cancer

The complexity of seeking a cure for cancer has vexed researchers for decades. While they’ve made remarkable progress, they are still waging a battle uphill as cancer remains one of the leading causes of death worldwide.

Yet scientists may soon have a critical new ally at their sides —  intelligent machines — that can attack that complexity in a different way.

Consider an example from the world of gaming: Last year, Google’s artificial intelligence platform, AlphaGo, deployed techniques in deep learning to beat South Korea Grand Master Lee Sedol in the immensely complex game of Go, which has more moves than there are stars in the universe.

Those same techniques of machine learning and AI can be brought to bear in the massive scientific puzzle of cancer.

One thing is certain — we won’t have a shot at conquering cancer with these new methods if we don’t have more data to work with. Many data sets, including medical records, genetic tests and mammograms, for example, are locked up and out of reach of our best scientific minds and our best learning algorithms.

The good news is that big data’s role in cancer research is now at center stage, and a number of large-scale, government-led sequencing initiatives are moving forward. Those include the U.S. Department of Veteran Affairs’ Million Veteran Program; the 100,000 Genomes Project in the U.K.; and the NIH’s The Cancer Genome Atlas, which holds data from more than 11,000 patients and is open to researchers everywhere to analyze via the cloud. According to a recent study, as many as 2 billion human genomes could be sequenced by 2025.

There are other trends driving demand for fresh data, including genetic testing. In 2007, sequencing one person’s genome cost $10 million. Today you can get this done for less than $1,000. In other words, for every person sequenced 10 years ago, we can now do 10,000. The implications are big: Discovering that you have a mutation linked to higher risk of certain types of cancer can sometimes be a life-saving bit of information. And as costs approach mass affordability, research efforts approach massive potential scale.

A central challenge for researchers (and society) is that current data sets lack both volume and ethnic diversity. In addition, researchers often face restrictive legal terms and reluctant sharing partnerships. Even when organizations share genomic data sets, the agreements are typically between individual institutions for individual data sets. While there are larger clearinghouses and databases operating today that have done great work, we need more work on standardized terms and platforms to accelerate access.

The potential benefits of these new technologies go beyond identifying risk and screening. Advances in machine learning can help accelerate cancer drug development and therapy selections, enabling doctors to match patients with clinical trials, and improving their abilities to provide custom treatment plans for cancer patients (Herceptin, one of the earliest examples, remains one of the best).

We believe three things need to happen to make data more available for use for cancer research and AI programs. First, patients should be able to contribute data easily. This includes medical records, radiology images and genetic testing. Laboratory companies and medical centers should adopt a common consent form to make it easy and legal for data sharing to occur. Second, more funding is needed for researchers working at the intersection of AI, data science and cancer. Just as the Chan Zuckerberg Foundation is funding new tool development for medicine, new AI techniques need to be funded for medical applications. Third, new data sets should be generated, focused on people of all ethnicities. We need to make sure that advances in cancer research are accessible to all.