2
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: not found

      ToxDL: deep learning using primary structure and domain embeddings for assessing protein toxicity

      Read this article at

      ScienceOpenPublisherPubMed
      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Motivation

          Genetically engineering food crops involves introducing proteins from other species into crop plant species or modifying already existing proteins with gene editing techniques. In addition, newly synthesized proteins can be used as therapeutic protein drugs against diseases. For both research and safety regulation purposes, being able to assess the potential toxicity of newly introduced/synthesized proteins is of high importance.

          Results

          In this study, we present ToxDL, a deep learning-based approach for in silico prediction of protein toxicity from sequence alone. ToxDL consists of (i) a module encompassing a convolutional neural network that has been designed to handle variable-length input sequences, (ii) a domain2vec module for generating protein domain embeddings and (iii) an output module that classifies proteins as toxic or non-toxic, using the outputs of the two aforementioned modules. Independent test results obtained for animal proteins and cross-species transferability results obtained for bacteria proteins indicate that ToxDL outperforms traditional homology-based approaches and state-of-the-art machine-learning techniques. Furthermore, through visualizations based on saliency maps, we are able to verify that the proposed network learns known toxic motifs. Moreover, the saliency maps allow for directed in silico modification of a sequence, thus making it possible to alter its predicted protein toxicity.

          Availability and implementation

          ToxDL is freely available at http://www.csbio.sjtu.edu.cn/bioinf/ToxDL/. The source code can be found at https://github.com/xypan1232/ToxDL.

          Supplementary information

          Supplementary data are available at Bioinformatics online.

          Related collections

          Most cited references26

          • Record: found
          • Abstract: found
          • Article: not found

          Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.

          S Altschul (1997)
          The BLAST programs are widely used tools for searching protein and DNA databases for sequence similarities. For protein comparisons, a variety of definitional, algorithmic and statistical refinements described here permits the execution time of the BLAST programs to be decreased substantially while enhancing their sensitivity to weak similarities. A new criterion for triggering the extension of word hits, combined with a new heuristic for generating gapped alignments, yields a gapped BLAST program that runs at approximately three times the speed of the original. In addition, a method is introduced for automatically combining statistically significant alignments produced by BLAST into a position-specific score matrix, and searching the database using this matrix. The resulting Position-Specific Iterated BLAST (PSI-BLAST) program runs at approximately the same speed per iteration as gapped BLAST, but in many cases is much more sensitive to weak but biologically relevant sequence similarities. PSI-BLAST is used to uncover several new and interesting members of the BRCT superfamily.
            Bookmark
            • Record: found
            • Abstract: not found
            • Article: not found

            Gradient-based learning applied to document recognition

              Bookmark
              • Record: found
              • Abstract: found
              • Article: found
              Is Open Access

              CD-HIT: accelerated for clustering the next-generation sequencing data

              Summary: CD-HIT is a widely used program for clustering biological sequences to reduce sequence redundancy and improve the performance of other sequence analyses. In response to the rapid increase in the amount of sequencing data produced by the next-generation sequencing technologies, we have developed a new CD-HIT program accelerated with a novel parallelization strategy and some other techniques to allow efficient clustering of such datasets. Our tests demonstrated very good speedup derived from the parallelization for up to ∼24 cores and a quasi-linear speedup for up to ∼8 cores. The enhanced CD-HIT is capable of handling very large datasets in much shorter time than previous versions. Availability: http://cd-hit.org. Contact: liwz@sdsc.edu Supplementary information: Supplementary data are available at Bioinformatics online.
                Bookmark

                Author and article information

                Contributors
                (View ORCID Profile)
                (View ORCID Profile)
                (View ORCID Profile)
                (View ORCID Profile)
                Journal
                Bioinformatics
                Oxford University Press (OUP)
                1367-4803
                1460-2059
                November 01 2020
                January 29 2021
                July 21 2020
                November 01 2020
                January 29 2021
                July 21 2020
                : 36
                : 21
                : 5159-5168
                Affiliations
                [1 ]Department of Automation, Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University, and Key Laboratory of System Control and Information Processing, Ministry of Education of China, Shanghai 200240, China
                [2 ]Department for Electronics and Information Systems, IDLab, Ghent University, Ghent 9000, Belgium
                [3 ]BASF Belgium Coordination Center – Innovation Center Gent, Ghent 9000, Belgium
                [4 ]Department of Environmental Technology, Food Technology and Molecular Biotechnology, Center for Biotech Data Science, Ghent University Global Campus, Songdo, Incheon 305-701, South Korea
                Article
                10.1093/bioinformatics/btaa656
                32692832
                3400e6b9-f843-4e87-ae10-3074569276c7
                © 2020

                https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model

                History

                Comments

                Comment on this article