Protein programmers get a helping hand from Cradle’s generative AI

Proteins are the molecules that get work done in nature, and there’s a whole industry emerging around successfully modifying and manufacturing them for various uses. But doing so is time consuming and haphazard; Cradle aims to change that with an AI-powered tool that tells scientists what new structures and sequences will make a protein do what they want it to. The company emerged from stealth today with a substantial seed round.

AI and proteins have been in the news lately, but largely because of the efforts of research outfits like DeepMind and Baker Lab. Their machine learning models take in easily collected RNA sequence data and predict the structure a protein will take — a step that used to take weeks and expensive special equipment.

But as incredible as that capability is in some domains, it’s just the starting point for others. Modifying a protein to be more stable or bind to a certain other molecule involves much more than just understanding its general shape and size.

“If you’re a protein engineer, and you want to design a certain property or function into a protein, just knowing what it looks like doesn’t help you. It’s like, if you have a picture of a bridge, that doesn’t tell you whether it’ll fall down or not,” explained Cradle CEO and co-founder Stef van Grieken.

“Alphafold takes a sequence and predicts what the protein will look like,” he continued. “We’re the generative brother of that: You pick the properties you want to engineer, and the model will generate sequences you can test in your laboratory.”

Predicting what proteins — especially ones new to science — will do in situ is a difficult task for lots of reasons, but in the context of machine learning the biggest issue is that there isn’t enough data available. So Cradle originated much of its own dataset in a wet lab, testing protein after protein and seeing what changes in their sequences seemed to lead to which effects.

Interestingly the model itself is not biotech-specific exactly but a derivative of the same “large language models” that have produced text production engines like GPT-3. Van Grieken noted that these models are not limited strictly to language in how they understand and predict data, an interesting “generalization” characteristic that researchers are still exploring.

Examples of the Cradle UI in action. Image Credits: Cradle

The protein sequences Cradle ingests and predicts are not in any language we know, of course, but they are relatively straightforward linear sequences of text that have associated meanings. “It’s like an alien programming language,” van Grieken said.

Protein engineers aren’t helpless, of course, but their work necessarily involves a lot of guessing. One may be fairly certain that among the 100 sequences they’re modifying is the combination that will produce the desired effect, but beyond that it comes down to exhaustive testing. A bit of a hint here could speed things up considerably and avoid a huge amount of fruitless labor.

The model works in three basic layers, he explained. First it assesses whether a given sequence is “natural,” i.e.. whether it is a meaningful sequence of amino acids or just random ones. This is akin to a language model just being able to say with 99% confidence that a sentence is in English (or Swedish, in van Grieken’s example), and the words are in the correct order. This it knows from “reading” millions of such sequences determined by lab analysis.

Next it looks at the actual or potential meaning in the protein’s alien language. “Imagine we give you a sequence, and this is the temperature at which this sequence will fall apart,” he said. “If you do that for a lot of sequences, you can say not just, ‘this looks natural,’ but ‘this looks like 26 degrees Celsius.’ that helps the model figure out what regions of the protein to focus on.”

The model can then suggest sequences to slot in — educated guesses, essentially, but a stronger starting point than scratch. The engineer or lab can then try them and bring that data back to the Cradle platform, where it can be re-ingested and used to fine-tune the model for the situation.

The Cradle team on a nice day at their HQ (van Grieken is center). Image Credits: Cradle

Modifying proteins for various purposes is useful across biotech, from drug design to biomanufacturing, and the path from vanilla molecule to customized, effective and efficient molecule can be long and expensive. Any way to shorten it will likely be welcomed by, at the very least, the lab techs who have to run hundreds of experiments just to get one good result.

Cradle has been operating in stealth and is now emerging having raised $5.5 million in a seed round co-led by Index Ventures and Kindred Capital, with participation from angels John Zimmer, Feike Sijbesma and Emily Leproust.

Van Grieken said the funding would allow the team to scale up data collection — the more the better when it comes to machine learning — and work on the product to make it “more self-service.”

“Our goal is to reduce the cost and time of getting a bio-based product to market by an order of magnitude,” said van Grieken in the press release, “so that anyone — even ‘two kids in their garage’ — can bring a bio-based product to market.”