
AI systems can generate novel proteins that meet structural design targets
(Nanowerk News) MIT researchers use artificial intelligence to design new proteins that go beyond those found in nature.
They created algorithms for machine learning that can produce proteins with certain structural characteristics, which could potentially be used to make materials with certain mechanical properties, such as stiffness or malleability. These materials, inspired by biology, could eventually replace petroleum-based or ceramic materials, but with a significantly reduced carbon footprint.
Scientists from MIT, MIT-IBM Watson AI Lab, and Tufts University used generative models, the same type of machine learning model structure used in AI platforms such as DALL-E 2. However, instead of using them to generate native images from natural language instructions like DALL-E 2 did, they modified the structure of the model to predict the amino acid sequence of the protein that fulfills predetermined structural goals.
In a paper published in the journal chemistry (“De novo protein generative design based on secondary structure constraints using a mindfulness-based diffusion model”), the researchers demonstrated how this model can produce native, fresh protein. Markus Buehler, Jerry McAfee Professor of Engineering and professor of civil and environmental engineering and mechanical engineering, senior author, stated that the model, which understands the biochemical connections that govern protein formation, can generate innovative proteins that have the potential to facilitate different applications. .
For example, this technology can be applied to design protein-mimicking food coatings that can prolong the freshness of fruits and vegetables while being safe for human consumption. In addition, Buehler emphasizes that models can generate millions of proteins in a few days, giving researchers a wide variety of new concepts to investigate in a short amount of time.
According to Buehler, who is a member of the MIT-IBM Watson AI Lab, “When considering the creation of proteins that nature has yet to uncover, this is a vast design space that cannot be solved by manual approaches. It is necessary to understand the language of life, how DNA encodes amino acids, and how they combine to produce protein structures. Before the advent of deep learning, this was not possible.”
Bo Ni, a postdoctoral researcher in the Buehler Laboratory for Atomistic and Molecular Mechanics, and David Kaplan, Stern Family Professor of Engineering and professor of bioengineering at Tufts, are also authors of the article.
Adapting new tools to the task
Proteins are produced by chains of amino acids that fold into a three-dimensional configuration. The mechanical properties of proteins are determined by the sequence of amino acids. Although scientists have detected thousands of proteins that have been formed by evolution, they estimate that a large number of amino acid sequences have yet to be identified.
To speed up the protein discovery process, scientists recently designed a deep learning model that can estimate the 3D structure of a protein for a specific set of amino acid sequences. However, the reverse problem – estimating the structural sequence of amino acids that meets the design goals – proved more complicated.
Buehler and his colleagues were able to address this challenging problem by leveraging the attention-based diffusion model, which is a new breakthrough in machine learning.
According to Buehler, attention-based models are very important in protein development because they can study and capture long-term relationships. This is very important because even a single mutation in a long amino acid sequence can have a significant impact on the overall design. By making use of the diffusion model, the learning process involves adding noise to the training data and then recovering the original data by removing the noise. This model is very effective in producing realistic, high-quality data that can be conditioned to meet certain design goals. Therefore, they are preferred over other models in meeting design requirements.
Using this architecture, the researchers developed two machine learning models capable of predicting novel amino acid sequences that make up proteins that meet specific structural design targets.
Buehler explains that in the biomedical industry, having a completely unknown protein can be problematic because its properties are not well understood. However, in some applications, it may be desirable to create new proteins with characteristics similar to those found in nature but with different functions. Using the developed model, a series of proteins can be generated and controlled by adjusting certain parameters, enabling customized designs to meet specific requirements.
The secondary structure, or general folding pattern, of amino acids gives rise to a variety of mechanical properties in proteins. For example, proteins with an alpha helix structure tend to be stretchy, while proteins with a beta sheet structure are usually rigid. Combining alpha helices and beta sheets in proteins can produce a material that is elastic and strong, such as silk.
The researchers created two models, one that functions at the overall structural protein level and another that operates at the amino acid level. Both models combine amino acid structures to produce proteins. In the first model, which works at an overall structural level, the user inputs the desired different structure percentages, such as 40 percent alpha-helix and 60 percent beta sheet, and the model generates sequences that meet those requirements. The second model requires the scientist to determine not only the percentages but also the structural order of the amino acids, giving it greater control over the final product.
The developed model is related to an algorithm that can predict protein folding. The researchers used this algorithm to determine the 3D structure of the resulting protein. They then calculated the mechanical properties of the resulting proteins and compared them with the specified design requirements. This allows them to verify whether the designed protein meets the desired specifications.
Realistic yet new design
To evaluate the effectiveness of their model, the researchers compared the newly generated proteins to existing proteins with similar structural properties. They found that many of the resulting proteins shared about 50 to 60 percent overlap with the existing amino acid sequences, indicating that they were feasible to synthesize. Moreover, the models generate completely new sequences, demonstrating their ability to design new proteins. According to Buehler, the degree of similarity between the resulting protein and the existing one indicates that the designed protein can be synthesized.
To validate the reliability of the designed protein, the researchers attempted to trick the model by providing a physically impossible design target. Instead of generating an unusual protein, the model generates the most realistic and synthesizable solution. These results indicate that the model is robust and can identify the closest feasible solution even when equipped with impossible design specifications.
Ni highlights that machine learning algorithms are capable of identifying hidden relationships in nature. This capability gives researchers confidence that the resulting proteins are likely to be realistic and feasible to synthesize.
In the next stage, the researchers aim to experimentally validate some of the newly designed proteins by synthesizing them in the laboratory. In addition, they plan to further enhance and refine their model, allowing them to design amino acid sequences that meet additional criteria, such as specific biological functions. The ultimate goal is to develop a versatile platform that can generate a wide variety of protein designs for use in a variety of applications, including biomedicine and materials science.
Buehler emphasizes that application fields, such as sustainability, medicine, food, health, and materials design, require solutions beyond what nature has provided. Therefore, newly developed design tools can play an important role in creating potential solutions to address pressing social problems. This tool allows researchers to design novel proteins with specific properties, which can be used in a variety of applications, from the development of new drugs to the manufacture of sustainable materials. Taken together, the tools can provide new approaches to problem solving and contribute to tackling key global challenges.
This research has the support of several organizations, including the MIT-IBM Watson AI Lab, the US Department of Agriculture, US Department of Energy, Office of the Army Research, National Institutes of Health, and Office of Naval Research. Support from these organizations highlights the significance and potential impact of research in a variety of fields.