
A machine learning-based protein annotation tool predicts protein function
(Nanowerk News) Microbes power the main processes of life on Earth. They influence the global element cycle—the movement of carbon, nitrogen and other elements. They also promote plant growth and influence disease development. This role is very important in every ecosystem. Research is constantly expanding databases of microbial DNA sequences but does not provide all the biological information about proteins.
To engineer microbes for sustainable bioenergy and other bioproducts, scientists need a more complete understanding of the function of proteins and other molecules. Scientists infer protein function by comparing it to a reference database of already characterized proteins.
However, this comparison is difficult and not scalable to large databases. To address this challenge, scientists have applied machine learning to models that predict protein function. The result is the program Carpenterwhich allows scientists to quickly model protein families.
Studying biological protein molecules in microbes will help scientists pursue new applications for engineered microbes. Snekmer is easy to deploy in high-performance computing environments. In addition, put in KBase framework as a new application that will allow users to annotate their genome and metagenome sequences. This will help scientists to better model the effects of engineered microbes. That includes the effects of these microbes on climate and their benefits to plant health and bioproduction. Snekmer will also help scientists study microbial evolution and microbiome patterns.
The inability of current methods to predict the function of 30–50% of bacterial protein sequences is a significant barrier to better understanding complex systems such as the soil microbiome. Most protocols rely on pairwise alignments, which become computationally complex and more challenging to interpret as the database grows.
For alignment-based protein family models, sensitivity and accuracy depend on the initial training set, which risks obsolescence as additional sequence diversity is discovered. Many bacterial proteins have no functional assignment or are assigned a general function based solely on taxonomic understanding.
To address this need, researchers at the Pacific Northwest National Laboratory, Baylor University, and Oregon Health & Science University developed Snekmer, software that leverages the redundancy of amino acid residue properties to reduce sequence space and uses short protein sequence (kmer) features for machine learning. to generate model protein families. Snekmer users can recode protein sequences into reduced kmer alphabet vectors and perform supervised construction of classification models trained on input protein families, or functional classification of proteins based on Snekmer models.
This research has been published in Advances in Bioinformatics (“Snekmer: a scalable channel for protein sequence fingerprinting based on amino acid recoding”).