83
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      Accurate De Novo Prediction of Protein Contact Map by Ultra-Deep Learning Model

      research-article

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Motivation

          Protein contacts contain key information for the understanding of protein structure and function and thus, contact prediction from sequence is an important problem. Recently exciting progress has been made on this problem, but the predicted contacts for proteins without many sequence homologs is still of low quality and not very useful for de novo structure prediction.

          Method

          This paper presents a new deep learning method that predicts contacts by integrating both evolutionary coupling (EC) and sequence conservation information through an ultra-deep neural network formed by two deep residual neural networks. The first residual network conducts a series of 1-dimensional convolutional transformation of sequential features; the second residual network conducts a series of 2-dimensional convolutional transformation of pairwise information including output of the first residual network, EC information and pairwise potential. By using very deep residual networks, we can accurately model contact occurrence patterns and complex sequence-structure relationship and thus, obtain higher-quality contact prediction regardless of how many sequence homologs are available for proteins in question.

          Results

          Our method greatly outperforms existing methods and leads to much more accurate contact-assisted folding. Tested on 105 CASP11 targets, 76 past CAMEO hard targets, and 398 membrane proteins, the average top L long-range prediction accuracy obtained by our method, one representative EC method CCMpred and the CASP11 winner MetaPSICOV is 0.47, 0.21 and 0.30, respectively; the average top L/10 long-range accuracy of our method, CCMpred and MetaPSICOV is 0.77, 0.47 and 0.59, respectively. Ab initio folding using our predicted contacts as restraints but without any force fields can yield correct folds (i.e., TMscore>0.6) for 203 of the 579 test proteins, while that using MetaPSICOV- and CCMpred-predicted contacts can do so for only 79 and 62 of them, respectively. Our contact-assisted models also have much better quality than template-based models especially for membrane proteins. The 3D models built from our contact prediction have TMscore>0.5 for 208 of the 398 membrane proteins, while those from homology modeling have TMscore>0.5 for only 10 of them. Further, even if trained mostly by soluble proteins, our deep learning method works very well on membrane proteins. In the recent blind CAMEO benchmark, our fully-automated web server implementing this method successfully folded 6 targets with a new fold and only 0.3L-2.3L effective sequence homologs, including one β protein of 182 residues, one α+β protein of 125 residues, one α protein of 140 residues, one α protein of 217 residues, one α/β of 260 residues and one α protein of 462 residues. Our method also achieved the highest F1 score on free-modeling targets in the latest CASP (Critical Assessment of Structure Prediction), although it was not fully implemented back then.

          Author Summary

          Protein contact prediction and contact-assisted folding has made good progress due to direct evolutionary coupling analysis (DCA). However, DCA is effective on only some proteins with a very large number of sequence homologs. To further improve contact prediction, we borrow ideas from deep learning, which has recently revolutionized object recognition, speech recognition and the GO game. Our deep learning method can model complex sequence-structure relationship and high-order correlation (i.e., contact occurrence patterns) and thus, improve contact prediction accuracy greatly. Our test results show that our method greatly outperforms the state-of-the-art methods regardless how many sequence homologs are available for a protein in question. Ab initio folding guided by our predicted contacts may fold many more test proteins than the other contact predictors. Our contact-assisted 3D models also have much better quality than homology models built from the training proteins, especially for membrane proteins. One interesting finding is that even trained mostly with soluble proteins, our method performs very well on membrane proteins. Recent blind CAMEO test confirms that our method can fold large proteins with a new fold and only a small number of sequence homologs.

          Related collections

          Most cited references21

          • Record: found
          • Abstract: found
          • Article: found
          Is Open Access

          Protein 3D Structure Computed from Evolutionary Sequence Variation

          The evolutionary trajectory of a protein through sequence space is constrained by its function. Collections of sequence homologs record the outcomes of millions of evolutionary experiments in which the protein evolves according to these constraints. Deciphering the evolutionary record held in these sequences and exploiting it for predictive and engineering purposes presents a formidable challenge. The potential benefit of solving this challenge is amplified by the advent of inexpensive high-throughput genomic sequencing. In this paper we ask whether we can infer evolutionary constraints from a set of sequence homologs of a protein. The challenge is to distinguish true co-evolution couplings from the noisy set of observed correlations. We address this challenge using a maximum entropy model of the protein sequence, constrained by the statistics of the multiple sequence alignment, to infer residue pair couplings. Surprisingly, we find that the strength of these inferred couplings is an excellent predictor of residue-residue proximity in folded structures. Indeed, the top-scoring residue couplings are sufficiently accurate and well-distributed to define the 3D protein fold with remarkable accuracy. We quantify this observation by computing, from sequence alone, all-atom 3D structures of fifteen test proteins from different fold classes, ranging in size from 50 to 260 residues., including a G-protein coupled receptor. These blinded inferences are de novo, i.e., they do not use homology modeling or sequence-similar fragments from known structures. The co-evolution signals provide sufficient information to determine accurate 3D protein structure to 2.7–4.8 Å Cα-RMSD error relative to the observed structure, over at least two-thirds of the protein (method called EVfold, details at http://EVfold.org). This discovery provides insight into essential interactions constraining protein evolution and will facilitate a comprehensive survey of the universe of protein structures, new strategies in protein and drug design, and the identification of functional genetic variants in normal and disease genomes.
            Bookmark
            • Record: found
            • Abstract: found
            • Article: not found

            Direct-coupling analysis of residue coevolution captures native contacts across many protein families.

            The similarity in the three-dimensional structures of homologous proteins imposes strong constraints on their sequence variability. It has long been suggested that the resulting correlations among amino acid compositions at different sequence positions can be exploited to infer spatial contacts within the tertiary protein structure. Crucial to this inference is the ability to disentangle direct and indirect correlations, as accomplished by the recently introduced direct-coupling analysis (DCA). Here we develop a computationally efficient implementation of DCA, which allows us to evaluate the accuracy of contact prediction by DCA for a large number of protein domains, based purely on sequence information. DCA is shown to yield a large number of correctly predicted contacts, recapitulating the global structure of the contact map for the majority of the protein domains examined. Furthermore, our analysis captures clear signals beyond intradomain residue contacts, arising, e.g., from alternative protein conformations, ligand-mediated residue couplings, and interdomain interactions in protein oligomers. Our findings suggest that contacts predicted by DCA can be used as a reliable guide to facilitate computational predictions of alternative protein conformations, protein complex formation, and even the de novo prediction of protein domain structures, contingent on the existence of a large number of homologous sequences which are being rapidly made available due to advances in genome sequencing.
              Bookmark
              • Record: found
              • Abstract: found
              • Article: not found

              PSICOV: precise structural contact prediction using sparse inverse covariance estimation on large multiple sequence alignments.

              The accurate prediction of residue-residue contacts, critical for maintaining the native fold of a protein, remains an open problem in the field of structural bioinformatics. Interest in this long-standing problem has increased recently with algorithmic improvements and the rapid growth in the sizes of sequence families. Progress could have major impacts in both structure and function prediction to name but two benefits. Sequence-based contact predictions are usually made by identifying correlated mutations within multiple sequence alignments (MSAs), most commonly through the information-theoretic approach of calculating mutual information between pairs of sites in proteins. These predictions are often inaccurate because the true covariation signal in the MSA is often masked by biases from many ancillary indirect-coupling or phylogenetic effects. Here we present a novel method, PSICOV, which introduces the use of sparse inverse covariance estimation to the problem of protein contact prediction. Our method builds on work which had previously demonstrated corrections for phylogenetic and entropic correlation noise and allows accurate discrimination of direct from indirectly coupled mutation correlations in the MSA. PSICOV displays a mean precision substantially better than the best performing normalized mutual information approach and Bayesian networks. For 118 out of 150 targets, the L/5 (i.e. top-L/5 predictions for a protein of length L) precision for long-range contacts (sequence separation >23) was ≥ 0.5, which represents an improvement sufficient to be of significant benefit in protein structure prediction or model quality assessment. The PSICOV source code can be downloaded from http://bioinf.cs.ucl.ac.uk/downloads/PSICOV.
                Bookmark

                Author and article information

                Contributors
                Role: Editor
                Journal
                PLoS Comput Biol
                PLoS Comput. Biol
                plos
                ploscomp
                PLoS Computational Biology
                Public Library of Science (San Francisco, CA USA )
                1553-734X
                1553-7358
                5 January 2017
                January 2017
                : 13
                : 1
                : e1005324
                Affiliations
                [001]Toyota Technological Institute at Chicago, Chicago, Illinois, United States of America
                Icahn School of Medicine at Mount Sinai, UNITED STATES
                Author notes

                The authors have declared that no competing interests exist.

                • Conceptualization: JX.

                • Data curation: SW.

                • Formal analysis: SW RZ JX.

                • Funding acquisition: JX.

                • Investigation: JX SW SS.

                • Methodology: JX.

                • Project administration: JX.

                • Resources: JX.

                • Software: JX SS SW ZL.

                • Supervision: JX.

                • Validation: JX SW SS RZ.

                • Visualization: SW.

                • Writing – original draft: JX SW.

                • Writing – review & editing: JX.

                Author information
                http://orcid.org/0000-0001-7111-4839
                Article
                PCOMPBIOL-D-16-01502
                10.1371/journal.pcbi.1005324
                5249242
                28056090
                c893ec61-e786-4733-8f22-d868ba2135e6
                © 2017 Wang et al

                This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

                History
                : 14 September 2016
                : 20 December 2016
                Page count
                Figures: 22, Tables: 13, Pages: 34
                Funding
                Funded by: funder-id http://dx.doi.org/10.13039/100000057, National Institute of General Medical Sciences;
                Award ID: R01GM089753
                Award Recipient :
                Funded by: funder-id http://dx.doi.org/10.13039/100000076, Directorate for Biological Sciences;
                Award ID: DBI-1564955
                Award Recipient :
                This work is supported by National Institutes of Health grant R01GM089753 to JX and National Science Foundation grant DBI-1564955 to JX. The authors are also grateful to the support of Nvidia Inc. and the computational resources provided by XSEDE through the grant MCB150134 to JX. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
                Categories
                Research Article
                Biology and Life Sciences
                Cell Biology
                Cellular Structures and Organelles
                Cell Membranes
                Membrane Proteins
                Biology and Life Sciences
                Molecular Biology
                Macromolecular Structure Analysis
                Protein Structure
                Protein Structure Prediction
                Biology and Life Sciences
                Biochemistry
                Proteins
                Protein Structure
                Protein Structure Prediction
                Computer and Information Sciences
                Neural Networks
                Biology and Life Sciences
                Neuroscience
                Neural Networks
                Biology and Life Sciences
                Molecular Biology
                Macromolecular Structure Analysis
                Protein Structure
                Biology and Life Sciences
                Biochemistry
                Proteins
                Protein Structure
                Research and Analysis Methods
                Mathematical and Statistical Techniques
                Statistical Methods
                Forecasting
                Physical Sciences
                Mathematics
                Statistics (Mathematics)
                Statistical Methods
                Forecasting
                Research and Analysis Methods
                Computational Techniques
                Split-Decomposition Method
                Multiple Alignment Calculation
                Research and Analysis Methods
                Mathematical and Statistical Techniques
                Mathematical Functions
                Convolution
                Research and Analysis Methods
                Database and Informatics Methods
                Biological Databases
                Protein Structure Databases
                Biology and Life Sciences
                Molecular Biology
                Macromolecular Structure Analysis
                Protein Structure
                Protein Structure Databases
                Biology and Life Sciences
                Biochemistry
                Proteins
                Protein Structure
                Protein Structure Databases
                Custom metadata
                vor-update-to-uncorrected-proof
                2017-01-20
                1) The PDB25 list is available at http://dunbrack.fccc.edu/PISCES.php. 2) The CASP11 test proteins are available at the CASP web site ( http://predictioncenter.org/). 3) The other data lists are provided in the paper and the Supporting Information files.

                Quantitative & Systems biology
                Quantitative & Systems biology

                Comments

                Comment on this article

                scite_
                0
                0
                0
                0
                Smart Citations
                0
                0
                0
                0
                Citing PublicationsSupportingMentioningContrasting
                View Citations

                See how this article has been cited at scite.ai

                scite shows how a scientific paper has been cited by providing the context of the citation, a classification describing whether it supports, mentions, or contrasts the cited claim, and a label indicating in which section the citation was made.

                Similar content50

                Cited by348

                Most referenced authors496