17
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      Learning protein constitutive motifs from sequence data

      research-article

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Statistical analysis of evolutionary-related protein sequences provides information about their structure, function, and history. We show that Restricted Boltzmann Machines (RBM), designed to learn complex high-dimensional data and their statistical features, can efficiently model protein families from sequence information. We here apply RBM to 20 protein families, and present detailed results for two short protein domains (Kunitz and WW), one long chaperone protein (Hsp70), and synthetic lattice proteins for benchmarking. The features inferred by the RBM are biologically interpretable: they are related to structure (residue-residue tertiary contacts, extended secondary motifs (α-helixes and β-sheets) and intrinsically disordered regions), to function (activity and ligand specificity), or to phylogenetic identity. In addition, we use RBM to design new protein sequences with putative properties by composing and 'turning up' or 'turning down' the different modes at will. Our work therefore shows that RBM are versatile and practical tools that can be used to unveil and exploit the genotype–phenotype relationship for protein families.

          eLife digest

          Almost every process that keeps a cell alive depends on the activities of several proteins. All proteins are made from chains of smaller molecules called amino acids, and the specific sequence of amino acids determines the protein’s overall shape, which in turn controls what the protein can do. Yet, the relationships between a protein’s structure and its function are complex, and it remains unclear how the sequence of amino acids in a protein actually determine its features and properties.

          Machine learning is a computational approach that is often applied to understand complex issues in biology. It uses computer algorithms to spot statistical patterns in large amounts of data and, after 'learning' from the data, the algorithms can then provide new insights, make predictions or even generate new data.

          Tubiana et al. have now used a relatively simple form of machine learning to study the amino acid sequences of 20 different families of proteins. First, frameworks of algorithms –known as Restricted Boltzmann Machines, RBM for short – were trained to read some amino-acid sequence data that coded for similar proteins. After ‘learning’ from the data, the RBM could then infer statistical patterns that were common to the sequences. Tubiana et al. saw that many of these inferred patterns could be interpreted in a meaningful way and related to properties of the proteins. For example, some were related to known twists and loops that are commonly found in proteins; others could be linked to specific activities. This level of interpretability is somewhat at odds with the results of other common methods used in machine learning, which tend to behave more like a ‘black box’.

          Using their RBM, Tubiana et al. then proposed how to design new proteins that may prove to have interesting features. In the future, similar methods could be applied across computational biology as a way to make sense of complex data in an understandable way.

          Related collections

          Most cited references66

          • Record: found
          • Abstract: found
          • Article: found
          Is Open Access

          Protein 3D Structure Computed from Evolutionary Sequence Variation

          The evolutionary trajectory of a protein through sequence space is constrained by its function. Collections of sequence homologs record the outcomes of millions of evolutionary experiments in which the protein evolves according to these constraints. Deciphering the evolutionary record held in these sequences and exploiting it for predictive and engineering purposes presents a formidable challenge. The potential benefit of solving this challenge is amplified by the advent of inexpensive high-throughput genomic sequencing. In this paper we ask whether we can infer evolutionary constraints from a set of sequence homologs of a protein. The challenge is to distinguish true co-evolution couplings from the noisy set of observed correlations. We address this challenge using a maximum entropy model of the protein sequence, constrained by the statistics of the multiple sequence alignment, to infer residue pair couplings. Surprisingly, we find that the strength of these inferred couplings is an excellent predictor of residue-residue proximity in folded structures. Indeed, the top-scoring residue couplings are sufficiently accurate and well-distributed to define the 3D protein fold with remarkable accuracy. We quantify this observation by computing, from sequence alone, all-atom 3D structures of fifteen test proteins from different fold classes, ranging in size from 50 to 260 residues., including a G-protein coupled receptor. These blinded inferences are de novo, i.e., they do not use homology modeling or sequence-similar fragments from known structures. The co-evolution signals provide sufficient information to determine accurate 3D protein structure to 2.7–4.8 Å Cα-RMSD error relative to the observed structure, over at least two-thirds of the protein (method called EVfold, details at http://EVfold.org). This discovery provides insight into essential interactions constraining protein evolution and will facilitate a comprehensive survey of the universe of protein structures, new strategies in protein and drug design, and the identification of functional genetic variants in normal and disease genomes.
            Bookmark
            • Record: found
            • Abstract: found
            • Article: not found

            Direct-coupling analysis of residue coevolution captures native contacts across many protein families.

            The similarity in the three-dimensional structures of homologous proteins imposes strong constraints on their sequence variability. It has long been suggested that the resulting correlations among amino acid compositions at different sequence positions can be exploited to infer spatial contacts within the tertiary protein structure. Crucial to this inference is the ability to disentangle direct and indirect correlations, as accomplished by the recently introduced direct-coupling analysis (DCA). Here we develop a computationally efficient implementation of DCA, which allows us to evaluate the accuracy of contact prediction by DCA for a large number of protein domains, based purely on sequence information. DCA is shown to yield a large number of correctly predicted contacts, recapitulating the global structure of the contact map for the majority of the protein domains examined. Furthermore, our analysis captures clear signals beyond intradomain residue contacts, arising, e.g., from alternative protein conformations, ligand-mediated residue couplings, and interdomain interactions in protein oligomers. Our findings suggest that contacts predicted by DCA can be used as a reliable guide to facilitate computational predictions of alternative protein conformations, protein complex formation, and even the de novo prediction of protein domain structures, contingent on the existence of a large number of homologous sequences which are being rapidly made available due to advances in genome sequencing.
              Bookmark
              • Record: found
              • Abstract: not found
              • Article: not found
              Is Open Access

              The Hsp70 and Hsp60 chaperone machines.

                Bookmark

                Author and article information

                Contributors
                Role: Reviewing Editor
                Role: Senior Editor
                Journal
                eLife
                Elife
                eLife
                eLife
                eLife Sciences Publications, Ltd
                2050-084X
                12 March 2019
                2019
                : 8
                : e39397
                Affiliations
                [1 ]deptLaboratory of Physics of the Ecole Normale Supérieure CNRS UMR 8023 & PSL Research ParisFrance
                Cambridge University United Kingdom
                Max Planck Institute for Developmental Biology Germany
                Cambridge University United Kingdom
                Author information
                http://orcid.org/0000-0001-8878-5620
                http://orcid.org/0000-0002-1852-7789
                http://orcid.org/0000-0002-4459-0204
                Article
                39397
                10.7554/eLife.39397
                6436896
                30857591
                9df61079-edc9-4564-857f-b28d9330a7d6
                © 2019, Tubiana et al

                This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.

                History
                : 03 October 2018
                : 24 February 2019
                Funding
                Funded by: FundRef http://dx.doi.org/10.13039/100007649, École Normale Supérieure;
                Award ID: Allocation Specifique
                Award Recipient :
                Funded by: FundRef http://dx.doi.org/10.13039/501100001665, Agence Nationale de la Recherche;
                Award ID: CE30-0021-01 RBMPro
                Award Recipient :
                The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.
                Categories
                Tools and Resources
                Computational and Systems Biology
                Physics of Living Systems
                Custom metadata
                A new machine-learning toolbox unveils coevolutionary protein motifs related to structure, function, and phylogeny from sequence information only.

                Life sciences
                sequence analysis,machine learning,coevolution,none
                Life sciences
                sequence analysis, machine learning, coevolution, none

                Comments

                Comment on this article