ECNet is an evolutionary context-integrated deep learning framework for protein engineering

Luo, Yunan; Jiang, Guangde; Yu, Tianhao; Liu, Yang; Vo, Lam; Ding, Hantian; Su, Yufeng; Qian, Wesley Wei; Zhao, Huimin; Peng, Jian

doi:10.1038/s41467-021-25976-8

ScienceOpen: research and publishing network

For Publishers

For Researchers

Blog
About

Search
Advanced search

views

recommends

Record: found
Abstract: found
Article: found

Is Open Access

ECNet is an evolutionary context-integrated deep learning framework for protein engineering

research-article

Author(s): Yunan Luo ¹ , Guangde Jiang ² , Tianhao Yu ² , Yang Liu ¹ , Lam Vo ² , Hantian Ding ¹ , Yufeng Su ¹ , Wesley Wei Qian ¹ , Huimin Zhao ² ^, , Jian Peng ¹ ^,

Publication date (Electronic): 30 September 2021

Journal: Nature Communications

Publisher: Nature Publishing Group UK

Keywords: Protein design, Machine learning

Read this article at

ScienceOpen Publisher PMC

Bookmark

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

Machine learning has been increasingly used for protein engineering. However, because the general sequence contexts they capture are not specific to the protein being engineered, the accuracy of existing machine learning algorithms is rather limited. Here, we report ECNet (evolutionary context-integrated neural network), a deep-learning algorithm that exploits evolutionary contexts to predict functional fitness for protein engineering. This algorithm integrates local evolutionary context from homologous sequences that explicitly model residue-residue epistasis for the protein of interest with the global evolutionary context that encodes rich semantic and structural features from the enormous protein sequence universe. As such, it enables accurate mapping from sequence to function and provides generalization from low-order mutants to higher-order mutants. We show that ECNet predicts the sequence-function relationship more accurately as compared to existing machine learning algorithms by using ~50 deep mutational scanning and random mutagenesis datasets. Moreover, we used ECNet to guide the engineering of TEM-1 β-lactamase and identified variants with improved ampicillin resistance with high success rates.

Abstract

Protein engineering is an active area of research in which machine learning has proven quite powerful. Here, the authors present a deep learning method that integrates both general and protein-specific sequence representations to improve the engineering of one’s protein of interest.

Related collections

Most cited references 66

Record: found
Abstract: found
Article: found

Is Open Access

UniProt: a worldwide hub of protein knowledge

(2018)

Abstract The UniProt Knowledgebase is a collection of sequences and annotations for over 120 million proteins across all branches of life. Detailed annotations extracted from the literature by expert curators have been collected for over half a million of these proteins. These annotations are supplemented by annotations provided by rule based automated systems, and those imported from other resources. In this article we describe significant updates that we have made over the last 2 years to the resource. We have greatly expanded the number of Reference Proteomes that we provide and in particular we have focussed on improving the number of viral Reference Proteomes. The UniProt website has been augmented with new data visualizations for the subcellular localization of proteins as well as their structure and interactions. UniProt resources are available under a CC-BY (4.0) license via the web at https://www.uniprot.org/.

0 comments Cited 2779 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

A method and server for predicting damaging missense mutations

Ivan Adzhubei, Steffen Schmidt, Leonid Peshkin … (2010)

To the Editor: Applications of rapidly advancing sequencing technologies exacerbate the need to interpret individual sequence variants. Sequencing of phenotyped clinical subjects will soon become a method of choice in studies of the genetic causes of Mendelian and complex diseases. New exon capture techniques will direct sequencing efforts towards the most informative and easily interpretable protein-coding fraction of the genome. Thus, the demand for computational predictions of the impact of protein sequence variants will continue to grow. Here we present a new method and the corresponding software tool, PolyPhen-2 (http://genetics.bwh.harvard.edu/pph2/), which is different from the early tool PolyPhen1 in the set of predictive features, alignment pipeline, and the method of classification (Fig. 1a). PolyPhen-2 uses eight sequence-based and three structure-based predictive features (Supplementary Table 1) which were selected automatically by an iterative greedy algorithm (Supplementary Methods). Majority of these features involve comparison of a property of the wild-type (ancestral, normal) allele and the corresponding property of the mutant (derived, disease-causing) allele, which together define an amino acid replacement. Most informative features characterize how well the two human alleles fit into the pattern of amino acid replacements within the multiple sequence alignment of homologous proteins, how distant the protein harboring the first deviation from the human wild-type allele is from the human protein, and whether the mutant allele originated at a hypermutable site2. The alignment pipeline selects the set of homologous sequences for the analysis using a clustering algorithm and then constructs and refines their multiple alignment (Supplementary Fig. 1). The functional significance of an allele replacement is predicted from its individual features (Supplementary Figs. 2–4) by Naïve Bayes classifier (Supplementary Methods). We used two pairs of datasets to train and test PolyPhen-2. We compiled the first pair, HumDiv, from all 3,155 damaging alleles with known effects on the molecular function causing human Mendelian diseases, present in the UniProt database, together with 6,321 differences between human proteins and their closely related mammalian homologs, assumed to be non-damaging (Supplementary Methods). The second pair, HumVar3, consists of all the 13,032 human disease-causing mutations from UniProt, together with 8,946 human nsSNPs without annotated involvement in disease, which were treated as non-damaging. We found that PolyPhen-2 performance, as presented by its receiver operating characteristic curves, was consistently superior compared to PolyPhen (Fig. 1b) and it also compared favorably with the three other popular prediction tools4–6 (Fig. 1c). For a false positive rate of 20%, PolyPhen-2 achieves the rate of true positive predictions of 92% and 73% on HumDiv and HumVar, respectively (Supplementary Table 2). One reason for a lower accuracy of predictions on HumVar is that nsSNPs assumed to be non-damaging in HumVar contain a sizable fraction of mildly deleterious alleles. In contrast, most of amino acid replacements assumed non-damaging in HumDiv must be close to selective neutrality. Because alleles that are even mildly but unconditionally deleterious cannot be fixed in the evolving lineage, no method based on comparative sequence analysis is ideal for discriminating between drastically and mildly deleterious mutations, which are assigned to the opposite categories in HumVar. Another reason is that HumDiv uses an extra criterion to avoid possible erroneous annotations of damaging mutations. For a mutation, PolyPhen-2 calculates Naïve Bayes posterior probability that this mutation is damaging and reports estimates of false positive (the chance that the mutation is classified as damaging when it is in fact non-damaging) and true positive (the chance that the mutation is classified as damaging when it is indeed damaging) rates. A mutation is also appraised qualitatively, as benign, possibly damaging, or probably damaging (Supplementary Methods). The user can choose between HumDiv- and HumVar-trained PolyPhen-2. Diagnostics of Mendelian diseases requires distinguishing mutations with drastic effects from all the remaining human variation, including abundant mildly deleterious alleles. Thus, HumVar-trained PolyPhen-2 should be used for this task. In contrast, HumDiv-trained PolyPhen-2 should be used for evaluating rare alleles at loci potentially involved in complex phenotypes, dense mapping of regions identified by genome-wide association studies, and analysis of natural selection from sequence data, where even mildly deleterious alleles must be treated as damaging. Supplementary Material 1

0 comments Cited 2166 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: found

Is Open Access

The Pfam protein families database in 2019

Sara El-Gebali, Jaina Mistry, Alex Bateman … (2018)

Abstract The last few years have witnessed significant changes in Pfam (https://pfam.xfam.org). The number of families has grown substantially to a total of 17,929 in release 32.0. New additions have been coupled with efforts to improve existing families, including refinement of domain boundaries, their classification into Pfam clans, as well as their functional annotation. We recently began to collaborate with the RepeatsDB resource to improve the definition of tandem repeat families within Pfam. We carried out a significant comparison to the structural classification database, namely the Evolutionary Classification of Protein Domains (ECOD) that led to the creation of 825 new families based on their set of uncharacterized families (EUFs). Furthermore, we also connected Pfam entries to the Sequence Ontology (SO) through mapping of the Pfam type definitions to SO terms. Since Pfam has many community contributors, we recently enabled the linking between authorship of all Pfam entries with the corresponding authors’ ORCID identifiers. This effectively permits authors to claim credit for their Pfam curation and link them to their ORCID record.

0 comments Cited 1642 times – based on 0 reviews      Review now

Bookmark

All references

Author and article information

Contributors

Huimin Zhao:

ORCID: http://orcid.org/0000-0002-9069-6739

zhao5@illinois.edu

Jian Peng:

ORCID: http://orcid.org/0000-0002-1736-2978

jianpeng@illinois.edu

Journal

Journal ID (nlm-ta): Nat Commun

Journal ID (iso-abbrev): Nat Commun

Title: Nature Communications

Publisher: Nature Publishing Group UK (London )

ISSN (Electronic): 2041-1723

Publication date (Electronic): 30 September 2021

Publication date PMC-release: 30 September 2021

Publication date Collection: 2021

Volume: 12

Electronic Location Identifier: 5743

Affiliations

[1 ]GRID grid.35403.31, ISNI 0000 0004 1936 9991, Department of Computer Science, , University of Illinois at Urbana-Champaign, ; Urbana-Champaign, IL USA

[2 ]GRID grid.35403.31, ISNI 0000 0004 1936 9991, Department of Chemical and Biomolecular Engineering, , University of Illinois at Urbana-Champaign, ; Urbana-Champaign, IL USA

Author information

Yunan Luo http://orcid.org/0000-0001-7728-6412

Huimin Zhao http://orcid.org/0000-0002-9069-6739

Jian Peng http://orcid.org/0000-0002-1736-2978

Article

Publisher ID: 25976

DOI: 10.1038/s41467-021-25976-8

PMC ID: 8484459

PubMed ID: 34593817

SO-VID: e2106d27-6ac2-45fc-bc00-41a5296a2386

License:

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

History

Date received : 10 April 2021

Date accepted : 9 September 2021

Funding

Funded by: FundRef https://doi.org/10.13039/100000001, National Science Foundation (NSF);

Award ID: 2019897

Award Recipient : Huimin Zhao

Funded by: FundRef https://doi.org/10.13039/100006206, DOE | SC | Biological and Environmental Research (BER);

Award ID: DE-SC0018420

Award Recipient : Huimin Zhao

Custom metadata

ScienceOpen disciplines: Uncategorized

Keywords: protein design,machine learning

Data availability:

ScienceOpen disciplines: Uncategorized

Keywords: protein design, machine learning

Comments

Comment on this article

scite_

Smart Citations

Citing PublicationsSupportingMentioningContrasting

View Citations

See how this article has been cited at scite.ai

scite shows how a scientific paper has been cited by providing the context of the citation, a classification describing whether it supports, mentions, or contrasts the cited claim, and a label indicating in which section the citation was made.

ECNet is an evolutionary context-integrated deep learning framework for protein engineering

Read this article at

Abstract

Abstract

Related collections

Annual Reviews AI, Machine Learning, and Society

Most cited references 66

UniProt: a worldwide hub of protein knowledge

A method and server for predicting damaging missense mutations

The Pfam protein families database in 2019

Author and article information

Contributors

Journal

Affiliations

Author information

Article

History

Funding

Categories

Custom metadata

Comments

Comment on this article

Similar content 42

Cited by 51

Most referenced authors 934