Researchers use generative AI to design new proteins
(Nanowerk News) Researchers at the University of Toronto have developed an artificial intelligence system that can make proteins not found in nature using generative diffusion, the same technology behind popular image generation platforms such as DALL-E and Midjourney.
The system will help advance the field of generative biology, which holds promise for accelerating drug development by making the design and testing of new therapeutic proteins more efficient and flexible.
“Our model learns from image representation to generate entirely new proteins, at very high levels,” said Philip M. Kim, a professor in the Donnelly Center for Cellular and Biomolecular Research at the U of T’s Temerty Faculty of Medicine. “All of our proteins appear to be biophysically distinct, meaning they fold into configurations that allow them to carry out specific functions in the cell.”
Journal Natural Computational Science (“Score-based generative modeling for de novo protein design”) published the findings, the first of its kind in a peer-reviewed journal. Kim’s lab also published a preprint on the model last summer via the bioRxiv open access server, predating two similar preprints from last December, RF Diffusion by the University of Washington and Chroma by Generate Biomedicines.
Proteins are made of chains of amino acids folded into three-dimensional shapes, which in turn determine the function of the protein. The forms evolved over billions of years and are varied and complex, but their number is also limited. With a better understanding of how existing proteins fold, researchers began to design folding patterns that are not produced in nature.
But the main challenge, says Kim, is imagining a fold that is both possible and functional. “It is very difficult to predict which folds are real and work in protein structure,” said Kim, who is also a professor in the department of molecular genetics and computer science at U of T. image, we can begin to deal with this problem.”
The new system, which the researchers call ProteinSGM, draws from a large set of image-like representations of existing proteins that accurately encode their structure. The researchers fed these images into a generative diffusion model, which gradually added noise until each image was all noise. The model tracks how the image gets noisier and then runs the process in reverse, learning how to turn random pixels into clear images that match entirely new proteins.
Jin Sub (Michael) Lee, a doctoral student in Kim’s lab and first author of the paper, said optimizing this early stage of the image generation process was one of the biggest challenges in creating ProteinSGM. “The key idea is the precise image-like representation of protein structures, so that diffusion models can learn how to generate new proteins accurately,” said Lee, who is originally from Vancouver but completed a bachelor’s degree in South Korea and a master’s degree in Switzerland previously. chose the U of T for his doctorate.
Also difficult is the validation of the proteins produced by ProteinSGM. Systems produce many structures, often unlike anything found in nature. Nearly all of them look real according to standard metrics, says Lee, but researchers need more evidence.
To test their new protein, Lee and his colleagues first turned to OmegaFold, an improved version of the DeepMind AlphaFold 2 software. Both platforms use AI to predict protein structure based on amino acid sequence.
With OmegaFold, the team confirmed that almost all of their new sequences folded into the desired and novel protein structures. They then chose a smaller amount to physically make in a test tube, to make sure the structure was that of a protein and not just strands of a chemical compound.
“With the match at OmegaFold and experimental testing in the lab, we are confident this is a properly folded protein. It was amazing to see this completely new validation of protein folding that doesn’t exist anywhere in nature,” said Lee.
Next steps based on this work include further development of ProteinSGM for antibodies and other proteins with the most therapeutic potential, said Kim. “This will be a very interesting area for research and entrepreneurship,” he added.
Lee said he would like to see generative biology move toward the composite design of protein sequences and structures, including the conformations of protein side chains. Most of the research to date has focused on building backbones, the main chemical structures that hold proteins together.
“Side chain configuration ultimately determines protein function, and although designing them means exponentially increasing complexity, it is possible with proper engineering,” said Lee. “We hope to find out.”