6
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      Prediction of human-virus protein-protein interactions through a sequence embedding-based machine learning method

      research-article
      a , 1 , b , 1 , c , d , e , f , g , a , *
      Computational and Structural Biotechnology Journal
      Research Network of Computational and Structural Biotechnology
      PPIs, protein-protein interactions, RF, Random Forest, Y2H, yeast two-hybrid, MS, mass spectroscopy, ML, machine learning, CT, Conjoint Triad, AC, Auto Covariance, LD, Local Descriptor, SGD, stochastic gradient descent, SVM, Support Vector Machine, Adaboost, Adaptive Boosting, MLP, Multiple Layer Perceptron, RBF, radial basis function, ACC, Accuracy, MCC, Matthews correlation coefficient, ROC, Receiver Operating Characteristic, PR, Precision-Recall, AUC, area under the ROC curve, AUPRC, area under the PR curve, Human-virus interaction, Protein-protein interaction, Prediction, Embedding, Doc2vec, Machine learning

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Graphical abstract

          Highlights

          • We proposed a doc2vec + RF classifier for predicting human-virus PPIs.

          • Doc2vec can effectively capture more context information from protein sequences.

          • The proposed method revealed better performance than several existing predictors.

          Abstract

          The identification of human-virus protein-protein interactions (PPIs) is an essential and challenging research topic, potentially providing a mechanistic understanding of viral infection. Given that the experimental determination of human-virus PPIs is time-consuming and labor-intensive, computational methods are playing an important role in providing testable hypotheses, complementing the determination of large-scale interactome between species. In this work, we applied an unsupervised sequence embedding technique (doc2vec) to represent protein sequences as rich feature vectors of low dimensionality. Training a Random Forest (RF) classifier through a training dataset that covers known PPIs between human and all viruses, we obtained excellent predictive accuracy outperforming various combinations of machine learning algorithms and commonly-used sequence encoding schemes. Rigorous comparison with three existing human-virus PPI prediction methods, our proposed computational framework further provided very competitive and promising performance, suggesting that the doc2vec encoding scheme effectively captures context information of protein sequences, pertaining to corresponding protein-protein interactions. Our approach is freely accessible through our web server as part of our host-pathogen PPI prediction platform ( http://zzdlab.com/InterSPPI/). Taken together, we hope the current work not only contributes a useful predictor to accelerate the exploration of human-virus PPIs, but also provides some meaningful insights into human-virus relationships.

          Related collections

          Most cited references41

          • Record: found
          • Abstract: found
          • Article: not found

          A comprehensive two-hybrid analysis to explore the yeast protein interactome.

          Protein-protein interactions play crucial roles in the execution of various biological functions. Accordingly, their comprehensive description would contribute considerably to the functional interpretation of fully sequenced genomes, which are flooded with novel genes of unpredictable functions. We previously developed a system to examine two-hybrid interactions in all possible combinations between the approximately 6,000 proteins of the budding yeast Saccharomyces cerevisiae. Here we have completed the comprehensive analysis using this system to identify 4,549 two-hybrid interactions among 3,278 proteins. Unexpectedly, these data do not largely overlap with those obtained by the other project [Uetz, P., et al. (2000) Nature (London) 403, 623-627] and hence have substantially expanded our knowledge on the protein interaction space or interactome of the yeast. Cumulative connection of these binary interactions generates a single huge network linking the vast majority of the proteins. Bioinformatics-aided selection of biologically relevant interactions highlights various intriguing subnetworks. They include, for instance, the one that had successfully foreseen the involvement of a novel protein in spindle pole body function as well as the one that may uncover a hitherto unidentified multiprotein complex potentially participating in the process of vesicular transport. Our data would thus significantly expand and improve the protein interaction map for the exploration of genome functions that eventually leads to thorough understanding of the cell as a molecular system.
            Bookmark
            • Record: found
            • Abstract: found
            • Article: not found

            Systematic identification of protein complexes in Saccharomyces cerevisiae by mass spectrometry.

            The recent abundance of genome sequence data has brought an urgent need for systematic proteomics to decipher the encoded protein networks that dictate cellular function. To date, generation of large-scale protein-protein interaction maps has relied on the yeast two-hybrid system, which detects binary interactions through activation of reporter gene expression. With the advent of ultrasensitive mass spectrometric protein identification methods, it is feasible to identify directly protein complexes on a proteome-wide scale. Here we report, using the budding yeast Saccharomyces cerevisiae as a test case, an example of this approach, which we term high-throughput mass spectrometric protein complex identification (HMS-PCI). Beginning with 10% of predicted yeast proteins as baits, we detected 3,617 associated proteins covering 25% of the yeast proteome. Numerous protein complexes were identified, including many new interactions in various signalling pathways and in the DNA damage response. Comparison of the HMS-PCI data set with interactions reported in the literature revealed an average threefold higher success rate in detection of known complexes compared with large-scale two-hybrid studies. Given the high degree of connectivity observed in this study, even partial HMS-PCI coverage of complex proteomes, including that of humans, should allow comprehensive identification of cellular networks.
              Bookmark
              • Record: found
              • Abstract: found
              • Article: not found

              The tandem affinity purification (TAP) method: a general procedure of protein complex purification.

              Identification of components present in biological complexes requires their purification to near homogeneity. Methods of purification vary from protein to protein, making it impossible to design a general purification strategy valid for all cases. We have developed the tandem affinity purification (TAP) method as a tool that allows rapid purification under native conditions of complexes, even when expressed at their natural level. Prior knowledge of complex composition or function is not required. The TAP method requires fusion of the TAP tag, either N- or C-terminally, to the target protein of interest. Starting from a relatively small number of cells, active macromolecular complexes can be isolated and used for multiple applications. Variations of the method to specifically purify complexes containing two given components or to subtract undesired complexes can easily be implemented. The TAP method was initially developed in yeast but can be successfully adapted to various organisms. Its simplicity, high yield, and wide applicability make the TAP method a very useful procedure for protein purification and proteome exploration. Copyright 2001 Academic Press.
                Bookmark

                Author and article information

                Contributors
                Journal
                Comput Struct Biotechnol J
                Comput Struct Biotechnol J
                Computational and Structural Biotechnology Journal
                Research Network of Computational and Structural Biotechnology
                2001-0370
                26 December 2019
                2020
                26 December 2019
                : 18
                : 153-161
                Affiliations
                [a ]State Key Laboratory of Agrobiotechnology, College of Biological Sciences, China Agricultural University, Beijing 100193, China
                [b ]State Key Laboratory of Plant Physiology and Biochemistry, College of Biological Sciences, China Agricultural University, Beijing 100193, China
                [c ]National Demonstration Center for Experimental Biological Sciences Education, College of Biological Sciences, China Agricultural University, Beijing 100193, China
                [d ]Dept. of Computer Science, University of Miami, Miami, FL 33146, USA
                [e ]Dept. of Biology, University of Miami, Miami, FL 33146, USA
                [f ]Center of Computational Science, University of Miami, Miami, FL 33146, USA
                [g ]Sylvester Comprehensive Cancer Center, University of Miami, Miami, FL 33136, USA
                Author notes
                [* ]Corresponding author. zidingzhang@ 123456cau.edu.cn
                [1]

                These two authors contributed equally to this work.

                Article
                S2001-0370(19)30429-5
                10.1016/j.csbj.2019.12.005
                6961065
                31969974
                a2b876e8-2d0c-4eb0-8c88-9b0dc0a7ead7
                © 2019 The Authors

                This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).

                History
                : 10 October 2019
                : 29 November 2019
                : 10 December 2019
                Categories
                Research Article

                ppis, protein-protein interactions,rf, random forest,y2h, yeast two-hybrid,ms, mass spectroscopy,ml, machine learning,ct, conjoint triad,ac, auto covariance,ld, local descriptor,sgd, stochastic gradient descent,svm, support vector machine,adaboost, adaptive boosting,mlp, multiple layer perceptron,rbf, radial basis function,acc, accuracy,mcc, matthews correlation coefficient,roc, receiver operating characteristic,pr, precision-recall,auc, area under the roc curve,auprc, area under the pr curve,human-virus interaction,protein-protein interaction,prediction,embedding,doc2vec,machine learning

                Comments

                Comment on this article