Researchers match DeepMind's AlphaFold2 protein folding power with faster, freely available model

DeepMind stunned the biology world late last year when its AlphaFold2 AI model predicted the structure of proteins (a common and very difficult problem) so accurately that many declared the decades-old problem “solved.” Now researchers claim to have leapfrogged DeepMind the way DeepMind leapfrogged the rest of the world, with RoseTTAFold, a system that does nearly the same thing at a fraction of the computational cost. (Oh, and it’s free to use.)

AlphaFold2 has been the talk of the industry since November, when it blew away the competition at CASP14, a virtual competition between algorithms built to predict the physical structure of a protein given the sequence of amino acids that make it up. The model from DeepMind was so far ahead of the others, so highly and reliably accurate, that many in the field have talked (half-seriously and in good humor) about moving on to a new field.

But one aspect that seemed to satisfy no one was DeepMind’s plans for the system. It was not exhaustively and openly described, and some worried that the company (which is owned by Alphabet/Google) was planning on more or less keeping the secret sauce to themselves — which would be their prerogative but also somewhat against the ethos of mutual aid in the scientific world.

Update: In something of a surprise, DeepMind published more detailed methods in the journal Nature today. The code is available on GitHub. This does considerably lessen the aforementioned concern, but the advance described below is still highly relevant. I’ve also added a comment from that team at the bottom of the article.

Alphabet’s DeepMind achieves historic new milestone in AI-based protein structure prediction

That concern seems to have been at least partly mooted by work from University of Washington researchers led by David Baker and Minkyung Baek, published in the latest issue of the journal Science. Baker, you may remember, recently won a Breakthrough Prize for his team’s work combating COVID-19 with engineered proteins.

The team’s new model, RoseTTAFold, makes predictions at similar accuracy levels using methods that Baker, responding to questions via email, candidly admitted were inspired by those used by AlphaFold2.

“The AlphaFold2 group presented several new high-level concepts at the CASP14 meeting. Starting from these ideas, and with a lot of collective brainstorming with colleagues in the group, Minkyung has been able to make amazing progress in very little time,” he said. (“She is amazing!” he added.)

Examples of predicted protein structures and their ground truths. A score above 90 is considered extremely good. Image Credits: UW/Baek et al

Baker’s group more or less placed second at CASP14, no mean feat, but hearing DeepMind’s methods described even generally set them on a collision course. They developed a “three-track” neural network that simultaneously considered the amino acid sequence (one dimension), distances between residues (two dimensions) and coordinates in space (three dimensions). The implementation is beyond complex and far outside the scope of this article, but the result is a model that achieves almost the same accuracy levels — levels, it bears repeating, that were completely unprecedented less than a year ago.

What’s more, RoseTTAFold accomplishes this level of accuracy far more quickly — that is, using less computation power. As the paper puts it:

DeepMind reported using several GPUs for days to make individual predictions, whereas our predictions are made in a single pass through the network in the same manner that would be used for a server…the end-to-end version of RoseTTAFold requires ~10 min on an RTX2080 GPU to generate backbone coordinates for proteins with less than 400 residues.

Hear that? It’s the sound of thousands of microbiologists sighing in relief and discarding drafts of emails asking for supercomputer time. It may not be easy to lay one’s hands on a 2080 these days, but the point is any high-end desktop GPU can perform this task in minutes, instead of requiring a high-end cluster running for days.

The modest requirements make RoseTTAFold suitable for public hosting and distribution as well, something that might never have been in the cards for AlphaFold2.

“We have a public server that anyone can submit protein sequences to and have the structures predicted,” Baker said. “There have been over 4,500 submissions since we put the server up a few weeks ago. We have also made the source code freely available.”

This may seem very niche, and it is, but protein folding has historically been one of the toughest problems in biology and one toward which countless hours of high-performance computing have been dedicated. You may recall Folding@Home, the popular distributed computing app that let people donate their computing cycles to attempting to predict protein structures. The kind of problem that might have taken a thousand computers days or weeks to do — essentially by brute-forcing solutions and checking for fit — now can be done in minutes on a single desktop.

The physical structure of proteins is of utmost importance in biology, as it is proteins that do the vast majority of tasks in our bodies, and proteins that must be modified, suppressed, enhanced and so on for therapeutic reasons; first, however, they need to be understood, and until November that understanding could not be reliably achieved computationally. At CASP14 it was proven to be possible, and now it has been made widely available.

It is not, by a long shot, a “solution” to the problem of protein folding, though the sentiment has been expressed. Most proteins at rest in neutral conditions can now have their structure predicted, and that has huge repercussions in multiple domains, but proteins are seldom found “at rest in neutral conditions.” They twist and contort to grab or release other molecules, to block or slip through gates and other proteins, and generally to do everything they do. These interactions are far more numerous, complex and difficult to predict, and neither AlphaFold2 nor RoseTTAFold can do so.

“There are many exciting chapters ahead… the story is just beginning,” said Baker.

Regarding the DeepMind paper, Baker offered the following comment in the spirit of collegiate camaraderie:

I’ve read through, and think this is a beautiful paper describing fantastic work.

The DeepMind paper is actually very complementary to our paper, and I think it is appropriate that it is not coming out after ours, as our work is really based on their advances.

I think that readers will enjoy reading both papers — they are very far from being duplicative. As we point out in our paper, their method is more accurate than ours, and now it will be very interesting to see what features of their approach are responsible for the remaining differences. We are already using RoseTTAFold for protein design and more systematic protein-protein complex structure prediction, and we are excited about rapidly improving these, along with traditional single chain modeling, by incorporating ideas from the DeepMind paper.

Another late addition from DeepMind, which upon reading through the Baker Lab paper wanted to point out that the accuracy difference is not trivial and the performance gap has been closed somewhat as well. I’ve asked for a bit of clarification on this but as you can no doubt see this is a fast moving area of research, so much so that even the leading labs can’t keep track of each other.

Your best bet to stay up to date, if you’re curious about the science and the potential repercussions, is reading this much more detailed and technical account of the methods and possible next steps written in the wake of AlphaFold2’s CASP14 performance. The experts cited there will have much more insight.

$3M Breakthrough Prize goes to scientist designing molecules to fight COVID-19