
Artificial intelligence catalyzes gene activation research and uncovers rare DNA sequences
(Nanowerk News) Artificial intelligence has exploded in our news feeds, with ChatGPT and related AI technologies becoming the focus of widespread public scrutiny. Beyond the popular chatbot, biologists are finding ways to leverage AI to investigate the core functions of our genes.
Previously, University of California San Diego researchers investigating DNA sequences that activate genes used artificial intelligence to identify mysterious puzzle pieces related to gene activation, a fundamental process involved in growth, development, and disease. Using machine learning, a type of artificial intelligence, School of Biological Sciences Professor James T. Kadonaga and his colleagues discovered the downstream core promoter region (DPR), the “gate” DNA activation code involved in the operation of up to a third of our genes.
Building on this discovery, Kadonaga and researchers Long Vongoc and Torrey E. Rhyne have now used machine learning to identify “extreme synthetic” DNA sequences with specially designed functions in gene activation.
Publication in journals Genes & Development (“Elementary analysis of Drosophila and human DPRs reveals distinct human variants whose specificities can be enhanced by machine learning”), the researchers tested millions of different DNA sequences through machine learning (AI) by comparing the activating elements of the DPR gene in humans versus fruit flies (Drosophila).
Using AI, they were able to find a specially designed series of rare DPRs that are active in humans but not in fruit flies and vice versa. More generally, this approach can now be used to identify synthetic DNA sequences with activities that could be of use in biotechnology and medicine.
“In the future, this strategy can be used to identify synthetic extreme DNA sequences with practical and useful applications. Instead of comparing humans (condition X) versus fruit flies (condition Y), we can test the ability of drug A (condition X) but not drug B (condition Y) to activate the gene,” said Kadonaga, a distinguished professor in the Department. Molecular Biology. “This method can also be used to find specially designed DNA sequences that activate genes in network 1 (state X) but not in network 2 (state Y). There are many practical applications of this AI-based approach. Synthetic extreme DNA sequences are probably very rare, maybe one in a million—if they exist, they can be found using AI.”
Machine learning is a branch of AI where computer systems are constantly improving and learning based on data and experience. In the new research, Kadonaga, Vongoc (a former UC San Diego postdoctoral researcher now at Velia Therapeutics) and Rhyne (associate staff researcher) used a method known as support vector regression to “train” a machine learning model with 200,000 predefined DNA sequences. based on data from real-world laboratory experiments. This is the target presented as an example for a machine learning system. They then “entered” 50 million test DNA sequences into a machine learning system for humans and fruit flies and asked them to compare the sequences and identify unique sequences in two very large data sets.
While the machine learning system showed that the human and fruit fly sequences overlapped to a large extent, the researchers focused on the core question of whether AI models could identify rare instances where gene activation was highly active in humans but not fruit flies. The answer is “yes.” Machine learning models have successfully identified human-specific (and fruit-fly) specific DNA sequences. Importantly, the extreme sequence functions predicted by the AI were verified in the Kadonaga laboratory using conventional assay methods (wet laboratory).
“Before starting this work, we did not know whether the AI model was ‘intelligent’ enough to predict the activity of 50 million sequences, especially outlier ‘extreme’ sequences with unusual activity. So it’s very impressive and very extraordinary that an AI model can predict rare extreme one-in-a-million order activity,” said Kadonaga, who added that it is essentially impossible to conduct 100 million comparable wet lab experiments. that machine learning technology is analyzed because each wet lab experiment will take nearly three weeks to complete.
The rare sequences identified by machine learning systems served as successful demonstrations and set the stage for other uses of machine learning and other AI technologies in biology.
“In everyday life, people find new applications for AI tools like ChatGPT. Here, we have demonstrated the use of AI for customized DNA element design in gene activation. This method should have practical applications in biotechnology and biomedical research,” said Kadonaga. “More broadly, biologists may just be starting to harness the power of AI technology.”