Enzyme Function Classification
2017, Handbook of Research on Machine Learning Innovations and Trends
https://doi.org/10.4018/978-1-5225-2229-4.CH008…
4 pages
Sign up for access to the world's latest research
Abstract
Enzymes are important in our life and it plays a vital role in the most biological processes in the living organisms and such as metabolic pathways. The classification of enzyme functionality from a sequence, structure data or the extracted features remains a challenging task. Traditional experiments consume more time, efforts, and cost. On the other hand, an automated classification of the enzymes saves efforts, money and time. The aim of this chapter is to cover and reviews the different approaches, which developed and conducted to classify and predict the functions of the enzyme proteins in addition to the new trends and challenges that could be considered now and in the future. The chapter addresses the main three approaches which are used in the classification the function of enzymatic proteins and illustrated the mechanism, pros, cons, and examples for each one.
Key takeaways
AI
AI
- Automated enzyme classification reduces time and costs compared to traditional experimental methods.
- The IUBMB enzyme classification system categorizes enzymes into six main classes based on their reactions.
- Three primary approaches exist for enzyme function classification: sequence alignment, structural analysis, and feature-based methods.
- The EC numbering scheme provides a systematic way to identify enzymes with four hierarchical levels.
- Future trends in enzyme classification focus on integrating computational methods for improved accuracy and efficiency.
Related papers
BMC Bioinformatics, 2009
Background: Efficient and accurate prediction of protein function from sequence is one of the standing problems in Biology. The generalised use of sequence alignments for inferring function promotes the propagation of errors, and there are limits to its applicability. Several machine learning methods have been applied to predict protein function, but they lose much of the information encoded by protein sequences because they need to transform them to obtain data of fixed length.
The problem of identifying the cellular functions and biochemical behavior of proteins is still an open problem in bioinformatics. It is further becoming more important as the number of sequenced information grows exponentially over time. Alignment methods are a useful approach to provide functional annotation, but its use is sometimes limited, prompting the development and use of machine learning methods. Recent efforts have so far given promising results. However current approaches have so far not used the information contained in the order of the amino acids in the peptidic sequence, using instead global parameters derived from peptidic composition and structural information available. Results: A novel methodology, peptidic programs, is presented and described. This technique consists in adjusting a set of minimal computer programs to the amino acids of a peptidic sequence, in order to retrieve knowledge directly from the primary sequence without any further information. The basic concepts of peptidic programs are described, namely a proposed instruction set, virtual machine, evaluation procedures and convergence methods. This methodology is tested over 33,500 enzymes divided in 182 distinct Enzyme Commission (EC) classes by creating individual binary classifiers for each. Above 95 % of all classifiers showed accuracies above 90 % in a cross validation set. The Matthews correlation coefficient showed results above 60% for 68% of all classification problems. Conclusions Overall results suggest that the tested methodology may be able to give meaningful classification results, in several cases detecting distant homologues. Peptidic programs further use very few computational resources, on average about 31 s, using common hardware, for assess if a protein belongs to a given class, making it a competitive technology for using on extensive data searches.
BMC Bioinformatics, 2011
Background: The ability to accurately predict enzymatic functions is an essential prerequisite for the interpretation of cellular functions, and the reconstruction and analysis of metabolic models. Several biological databases exist that provide such information. However, in many cases these databases provide partly different and inconsistent genome annotations.
Journal of Computer Aided Chemistry, 2005
We propose a new method for the prediction of protein function, especially enzyme activity, based on statistical characteristics of oligopeptides. A known function of a protein is regarded to be inherited to its oligopeptides, and the correspondence between oligopeptides and the function is calculated in the whole proteins. In our method, unknown functions of proteins are predicted by means of the correspondence automatically. We measured the prediction performance for several enzymes by recall, precision and maximum f-measure using 28,520 whole human proteins registered in RefSeq. This paper reports prediction of a specific enzyme 'protein-tyrosine kinase' (EC 2.7.1.112) and a large class of enzymes 'transferases' (EC 2.- .- .-). The former and the latter score maximum f-measure of 0.932 and 0.786, respectively. The results suggest that the proposed method is quite efficient in predicting enzyme activity.
PLoS Computational Biology, 2007
Predicting the function of a protein from its sequence is a long-standing goal of bioinformatic research. While sequence similarity is the most popular tool used for this purpose, sequence motifs may also subserve this goal. Here we develop a motif-based method consisting of applying an unsupervised motif extraction algorithm (MEX) to all enzyme sequences, and filtering the results by the four-level classification hierarchy of the Enzyme Commission (EC). The resulting motifs serve as specific peptides (SPs), appearing on single branches of the EC. In contrast to previous motif-based methods, the new method does not require any preprocessing by multiple sequence alignment, nor does it rely on over-representation of motifs within EC branches. The SPs obtained comprise on average 8.4 6 4.5 amino acids, and specify the functions of 93% of all enzymes, which is much higher than the coverage of 63% provided by ProSite motifs. The SP classification thus compares favorably with previous function annotation methods and successfully demonstrates an added value in extreme cases where sequence similarity fails. Interestingly, SPs cover most of the annotated active and binding site amino acids, and occur in active-site neighboring 3-D pockets in a highly statistically significant manner. The latter are assumed to have strong biological relevance to the activity of the enzyme. Further filtering of SPs by biological functional annotations results in reduced small subsets of SPs that possess very large enzyme coverage. Overall, SPs both form a very useful tool for enzyme functional classification and bear responsibility for the catalytic biological function carried out by enzymes.
2009
©2009 Arakaki et al; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The electronic version of this article is the complete one and can be found online at: http://www.biomedcentral.com/1471-2105/10/107doi:10.1186/1471-2105-10-107Background: We previously developed EFICAz, an enzyme function inference approach that combines predictions from non-completely overlapping component methods. Two of the four components in the original EFICAz are based on the detection of functionally discriminating residues (FDRs). FDRs distinguish between member of an enzyme family that are homofunctional (classified under the EC number of interest) or heterofunctional (annotated with another EC number or lacking enzymatic activity). Each of the two FDR-bas...
Applied and Environmental Microbiology, 2013
ABSTRACTFunctional prediction of carbohydrate-active enzymes is difficult due to low sequence identity. However, similar enzymes often share a few short motifs, e.g., around the active site, even when the overall sequences are very different. To exploit this notion for functional prediction of carbohydrate-active enzymes, we developed a simple algorithm, peptide pattern recognition (PPR), that can divide proteins into groups of sequences that share a set of short conserved sequences. When this method was used on 118 glycoside hydrolase 5 proteins with 9% average pairwise identity and representing four characterized enzymatic functions, 97% of the proteins were sorted into groups correlating with their enzymatic activity. Furthermore, we analyzed 8,138 glycoside hydrolase 13 proteins including 204 experimentally characterized enzymes with 28 different functions. There was a 91% correlation between group and enzyme activity. These results indicate that the function of carbohydrate-act...
BMC Bioinformatics, 2009
Background: Predicting the function of a protein from its sequence is a long-standing challenge of bioinformatic research, typically addressed using either sequence-similarity or sequence-motifs. We employ the novel motif method that consists of Specific Peptides (SPs) that are unique to specific branches of the Enzyme Commission (EC) functional classification. We devise the Data Mining of Enzymes (DME) methodology that allows for searching SPs on arbitrary proteins, determining from its sequence whether a protein is an enzyme and what the enzyme's EC classification is.
2012
In protein databases there is a substantial number of proteins structurally determined but without function annotation. Understanding the relationship between function and structure can be useful to predict function on a large scale. We have analyzed the similarities in global physicochemical parameters for a set of enzymes which were classified according to the four Enzyme Commission (EC) hierarchical levels. Using relevance theory we introduced a distance between proteins in the space of physicochemical characteristics. This was done by minimizing a cost function of the metric tensor built to reflect the EC classification system. Using an unsupervised clustering method on a set of 1025 enzymes, we obtained no relevant clustering formation compatible with EC classification. The distance distributions between enzymes from the same EC group and from different EC groups were compared by histograms. Such analysis was also performed using sequence alignment similarity as a distance. Our results suggest that global structure parameters are not sufficient to segregate enzymes according to EC hierarchy. This indicates that features essential for function are rather local than global. Consequently, methods for predicting function based on global attributes should not obtain high accuracy in main EC classes prediction without relying on similarities between enzymes from training and validation datasets. Furthermore, these results are consistent with a substantial number of studies suggesting that function evolves fundamentally by recruitment, i.e., a same protein motif or fold can be used to perform different enzymatic functions and a few number of specific amino acids are actually responsible for enzyme activity. These essential amino acids should belong to active sites and an effective method for predicting function should be able to recognize them.
EURASIP Journal on Bioinformatics and Systems Biology, 2012
Advancements in sequencing technologies have witnessed an exponential rise in the number of newly found enzymes. Enzymes are proteins that catalyze biochemical reactions and play an important role in metabolic pathways. Commonly, function of such enzymes is determined by experiments that can be time consuming and costly. Hence, a need for a computing method is felt that can distinguish protein enzyme sequences from those of non-enzymes and reliably predict the function of the former. To address this problem, approaches that cluster enzymes based on their sequence and structural similarity have been presented. But, these approaches are known to fail for proteins that perform the same function and are dissimilar in their sequence and structure. In this article, we present a supervised machine learning model to predict the function class and sub-class of enzymes based on a set of 73 sequence-derived features. The functional classes are as defined by International Union of Biochemistry and Molecular Biology. Using an efficient data mining algorithm called random forest, we construct a top-down three layer model where the top layer classifies a query protein sequence as an enzyme or non-enzyme, the second layer predicts the main function class and bottom layer further predicts the sub-function class. The model reported overall classification accuracy of 94.87% for the first level, 87.7% for the second, and 84.25% for the bottom level. Our results compare very well with existing methods, and in many cases report better performance. Using feature selection methods, we have shown the biological relevance of a few of the top rank attributes.
D,alaa Tharwat