21
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      Multi-scale encoding of amino acid sequences for predicting protein interactions using gradient boosting decision tree

      research-article

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Nowadays a number of computational approaches have been developed to effectively and accurately predict protein interactions. However, most of these methods typically perform worse when other biological data sources (e.g., protein structure information, protein domains, or gene neighborhoods information) are not available. In the present work, we propose a method for predicting protein interactions making full use of physicochemical characteristics of amino acids. A protein sequence is encoded at multi-scale by seven properties, including their qualitative and quantitative descriptions, of amino acids. Five kinds of protein descriptors, frequency, composition, transformation, distribution and auto covariance, are extracted from these encodings for representing each protein sequence. The new formed feature representation consisted of 347 dimensions is able to capture not only the compositional and positional information but also their statistical significance of amino acids in the sequence. Based on such a feature representation, the gradient boosting decision tree algorithm is introduced to predict protein interaction class. When the proposed method is tested with the PPI data of S.cerevisiae, it achieves a prediction accuracy of 95.28% at the Matthew’s correlation coefficient of 90.68%. Compared with the state-of-the-art works on H.pylori and Human, the accuracies can be raised to 89.27% and 98.00% respectively. Extensive experiments are performed for a crossover protein-protein interactions network and the prediction accuracies are also very promising. Because of learning capabilities of the gradient boosting decision tree and the mutil-scale feature representation scheme, the proposed method might be a useful tool for future proteomics studies.

          Related collections

          Most cited references28

          • Record: found
          • Abstract: found
          • Article: not found

          Systematic identification of protein complexes in Saccharomyces cerevisiae by mass spectrometry.

          The recent abundance of genome sequence data has brought an urgent need for systematic proteomics to decipher the encoded protein networks that dictate cellular function. To date, generation of large-scale protein-protein interaction maps has relied on the yeast two-hybrid system, which detects binary interactions through activation of reporter gene expression. With the advent of ultrasensitive mass spectrometric protein identification methods, it is feasible to identify directly protein complexes on a proteome-wide scale. Here we report, using the budding yeast Saccharomyces cerevisiae as a test case, an example of this approach, which we term high-throughput mass spectrometric protein complex identification (HMS-PCI). Beginning with 10% of predicted yeast proteins as baits, we detected 3,617 associated proteins covering 25% of the yeast proteome. Numerous protein complexes were identified, including many new interactions in various signalling pathways and in the DNA damage response. Comparison of the HMS-PCI data set with interactions reported in the literature revealed an average threefold higher success rate in detection of known complexes compared with large-scale two-hybrid studies. Given the high degree of connectivity observed in this study, even partial HMS-PCI coverage of complex proteomes, including that of humans, should allow comprehensive identification of cellular networks.
            Bookmark
            • Record: found
            • Abstract: found
            • Article: not found

            Global analysis of protein activities using proteome chips.

            To facilitate studies of the yeast proteome, we cloned 5800 open reading frames and overexpressed and purified their corresponding proteins. The proteins were printed onto slides at high spatial density to form a yeast proteome microarray and screened for their ability to interact with proteins and phospholipids. We identified many new calmodulin- and phospholipid-interacting proteins; a common potential binding motif was identified for many of the calmodulin-binding proteins. Thus, microarrays of an entire eukaryotic proteome can be prepared and screened for diverse biochemical activities. The microarrays can also be used to screen protein-drug interactions and to detect posttranslational modifications.
              Bookmark
              • Record: found
              • Abstract: found
              • Article: found
              Is Open Access

              Using support vector machine combined with auto covariance to predict protein–protein interactions from protein sequences

              Compared to the available protein sequences of different organisms, the number of revealed protein–protein interactions (PPIs) is still very limited. So many computational methods have been developed to facilitate the identification of novel PPIs. However, the methods only using the information of protein sequences are more universal than those that depend on some additional information or predictions about the proteins. In this article, a sequence-based method is proposed by combining a new feature representation using auto covariance (AC) and support vector machine (SVM). AC accounts for the interactions between residues a certain distance apart in the sequence, so this method adequately takes the neighbouring effect into account. When performed on the PPI data of yeast Saccharomyces cerevisiae, the method achieved a very promising prediction result. An independent data set of 11 474 yeast PPIs was used to evaluate this prediction model and the prediction accuracy is 88.09%. The performance of this method is superior to those of the existing sequence-based methods, so it can be a useful supplementary tool for future proteomics studies. The prediction software and all data sets used in this article are freely available at http://www.scucic.cn/Predict_PPI/index.htm.
                Bookmark

                Author and article information

                Contributors
                Role: Editor
                Journal
                PLoS One
                PLoS ONE
                plos
                plosone
                PLoS ONE
                Public Library of Science (San Francisco, CA USA )
                1932-6203
                2017
                8 August 2017
                : 12
                : 8
                : e0181426
                Affiliations
                [1 ] School of Computer Science and Technology, Tianjin University, Tianjin, China, 300072
                [2 ] Tianjin Key Laboratory of Cognitive Computing and Application, Tianjin, China, 300072
                Harbin Institute of Technology Shenzhen Graduate School, CHINA
                Author notes

                Competing Interests: The authors have declared that no competing interests exist.

                • Data curation: CZ YJD.

                • Formal analysis: CZ XJG.

                • Funding acquisition: XJG.

                • Investigation: XJG.

                • Methodology: XJG HY.

                • Project administration: XJG.

                • Software: CZ.

                • Supervision: XJG.

                • Validation: HY.

                • Visualization: CZ.

                • Writing – original draft: CZ.

                • Writing – review & editing: XJG HY FG.

                Article
                PONE-D-16-47378
                10.1371/journal.pone.0181426
                5549711
                28792503
                70ff8f1b-f932-40ec-a50d-e187f150ea8c
                © 2017 Zhou et al

                This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

                History
                : 3 January 2017
                : 30 June 2017
                Page count
                Figures: 6, Tables: 12, Pages: 18
                Funding
                Funded by: funder-id http://dx.doi.org/10.13039/501100001809, National Natural Science Foundation of China;
                Award ID: 61930007
                Award Recipient :
                Funded by: National High Technology Research and Development Program of China
                Award ID: 2015BA3005
                Award Recipient :
                This work was supported by: National Natural Science Foundation of China (61930007), URL: http://www.nsfc.gov.cn/ (XG); National High Technology Research and Development Program of China (863 Program) (2015BA3005). URL: http://www.most.gov.cn/eng/programmes1/ (XG); and National 973 Program (2013CB32930X), URL: http://www.most.gov.cn/eng/programmes1/200610/t20061009_36223.htm (XG). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
                Categories
                Research Article
                Biology and Life Sciences
                Molecular Biology
                Molecular Biology Techniques
                Sequencing Techniques
                Protein Sequencing
                Research and Analysis Methods
                Molecular Biology Techniques
                Sequencing Techniques
                Protein Sequencing
                Engineering and Technology
                Management Engineering
                Decision Analysis
                Decision Trees
                Research and Analysis Methods
                Decision Analysis
                Decision Trees
                Biology and Life Sciences
                Biochemistry
                Proteins
                Protein Interactions
                Research and Analysis Methods
                Extraction Techniques
                Protein Extraction
                Engineering and Technology
                Management Engineering
                Decision Analysis
                Decision Trees
                Decision Tree Learning
                Research and Analysis Methods
                Decision Analysis
                Decision Trees
                Decision Tree Learning
                Computer and Information Sciences
                Artificial Intelligence
                Machine Learning
                Decision Tree Learning
                Physical Sciences
                Mathematics
                Applied Mathematics
                Algorithms
                Machine Learning Algorithms
                Research and Analysis Methods
                Simulation and Modeling
                Algorithms
                Machine Learning Algorithms
                Computer and Information Sciences
                Artificial Intelligence
                Machine Learning
                Machine Learning Algorithms
                Physical Sciences
                Mathematics
                Probability Theory
                Random Variables
                Covariance
                Computer and Information Sciences
                Network Analysis
                Protein Interaction Networks
                Biology and Life Sciences
                Biochemistry
                Proteomics
                Protein Interaction Networks
                Custom metadata
                The minimal underlying data set necessary for replication of this study, including the source codes and data sets used in the manuscript have been uploaded to github ( https://github.com/lovekeyczw/zhouchang).

                Uncategorized
                Uncategorized

                Comments

                Comment on this article