(Nanowerk News) Discovering new ingredients and drugs usually involves a manual, trial and error process that can take decades and cost millions of dollars. To streamline this process, scientists often use machine learning to predict molecular properties and narrow down the molecules they need to synthesize and test in the lab.
Researchers from MIT and the MIT-Watson AI Lab have developed unitary framework (PDF) which can simultaneously predict molecular properties and generate new molecules much more efficiently than these popular deep learning approaches.
To teach a machine learning model to predict a molecule’s biological or mechanical properties, researchers must show millions of labeled molecular structures — a process known as training. Due to the expense of discovering molecules and the challenges of labeling millions of structures by hand, large training data sets are often difficult to obtain, which limits the effectiveness of machine learning approaches.
In contrast, the system created by the MIT researchers can effectively predict the properties of molecules using only small amounts of data. Their system has a fundamental understanding of the rules that define how building blocks combine to produce valid molecules. These rules capture similarities between molecular structures, which helps systems generate new molecules and predict their properties in a data-efficient way.
This method outperforms other machine learning approaches on both small and large data sets, and is able to accurately predict molecular properties and produce viable molecules when given a data set with less than 100 samples.
“Our goal with this project was to use multiple data-driven methods to accelerate the discovery of new molecules, so you can train models to make predictions without all of these costly experiments,” said lead author Minghao Guo, a computer science and electrical engineering graduate student. (EECS).
Guo’s co-authors include MIT-IBM Watson AI Lab research staff members Veronika Thost, Payel Das, and Jie Chen; recent MIT graduates Samuel Song ’23 and Adithya Balachandran ’23; and senior author Wojciech Matusik, a professor of electrical engineering and computer science and a member of the MIT-IBM Watson AI Lab, who leads the Computational Design and Fabrication Group within the MIT Computer Science and Artificial Intelligence Laboratory (CSAIL). This research will be presented at the International Conference for Machine Learning.
Learn molecular language
To achieve the best results with machine learning models, scientists need training data sets with millions of molecules that have similar properties to the ones they want to discover. In reality, these domain-specific datasets are usually very small. So researchers use models that have been previously trained on large data sets of common molecules, which they apply to much smaller target data sets. However, because these models have not acquired much domain-specific knowledge, they tend to perform poorly.
The MIT team took a different approach. They created a machine learning system that automatically learned a molecular “language” — known as a molecular grammar — using only small domain-specific data sets. It uses this grammar to build viable molecules and predict their properties.
In language theory, one generates words, sentences or paragraphs based on a set of grammatical rules. You can think of molecular grammar in the same way. It is a set of production rules that define how to produce molecules or polymers by combining atoms and substructures.
Just as the grammar of a language, which can generate a large number of sentences using the same rules, a single molecular grammar can represent a large number of molecules. Molecules with similar structures use the same grammatical production rules, and learning systems understand these similarities.
Because structurally similar molecules often have similar properties, the system uses its underlying knowledge of molecular similarity to more efficiently predict the properties of new molecules.
“Once we have this grammar as a representation for all the different molecules, we can use it to speed up the property prediction process,” said Guo.
The system learns the production rules for molecular grammar using reinforcement learning — a process of trial and error in which a model is rewarded for behavior that brings it closer to achieving a goal.
But because there are possibly billions of ways to combine atoms and substructures, the process for learning the production rules of a grammar would be too computationally expensive for anything but the smallest data set.
The researchers separated the molecular grammar into two parts. The first part, called the metagrammar, is a widely applicable general grammar that they manually designed and provided the system with at the start. It then only needs to learn a much smaller molecule-specific grammar of the domain data set. This hierarchical approach accelerates the learning process.
Big results, small data set
In experiments, the researchers’ new system simultaneously generates viable molecules and polymers, and predicts their properties more accurately than some popular machine learning approaches, even when domain-specific data sets have only a few hundred samples. Some other methods also require an expensive pre-training step which the new system avoids.
This technique is very effective in predicting the physical properties of polymers, such as the glass transition temperature, which is the temperature required for a material to transition from solid to liquid. Obtaining this information manually is often very expensive because the experiments require very high temperatures and pressures.
To push their approach even further, the researchers cut a training set by more than half – down to just 94 samples. Their models still achieve equivalent results to methods trained using the entire data set.
“This grammar-based representation is very powerful. And because the grammar itself is a very general representation, it can be used for many kinds of graph form data. We are trying to identify other applications beyond chemistry or materials science,” said Guo.
In the future, they also want to expand their current molecular grammar to include the 3D geometries of molecules and polymers, which are key to understanding the interactions between polymer chains. They also developed an interface that would show the user the production rules of the learned grammar and ask for feedback to correct rules that might be wrong, thereby increasing the accuracy of the system.