13
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      Mapping the glycosyltransferase fold landscape using interpretable deep learning

      research-article

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Glycosyltransferases (GTs) play fundamental roles in nearly all cellular processes through the biosynthesis of complex carbohydrates and glycosylation of diverse protein and small molecule substrates. The extensive structural and functional diversification of GTs presents a major challenge in mapping the relationships connecting sequence, structure, fold and function using traditional bioinformatics approaches. Here, we present a convolutional neural network with attention (CNN-attention) based deep learning model that leverages simple secondary structure representations generated from primary sequences to provide GT fold prediction with high accuracy. The model learns distinguishing secondary structure features free of primary sequence alignment constraints and is highly interpretable. It delineates sequence and structural features characteristic of individual fold types, while classifying them into distinct clusters that group evolutionarily divergent families based on shared secondary structural features. We further extend our model to classify GT families of unknown folds and variants of known folds. By identifying families that are likely to adopt novel folds such as GT91, GT96 and GT97, our studies expand the GT fold landscape and prioritize targets for future structural studies.

          Abstract

          Glycosyltransferases (GT) are proteins that display extensive sequence and functional variation on a subset of 3D folds. Here, the authors use interpretable deep learning to predict 3D folds from sequence without the need for sequence alignment, which also enables the prediction of GTs with new folds.

          Related collections

          Most cited references41

          • Record: found
          • Abstract: found
          • Article: not found

          Long Short-Term Memory

          Learning to store information over extended time intervals by recurrent backpropagation takes a very long time, mostly because of insufficient, decaying error backflow. We briefly review Hochreiter's (1991) analysis of this problem, then address it by introducing a novel, efficient, gradient-based method called long short-term memory (LSTM). Truncating the gradient where this does not do harm, LSTM can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units. Multiplicative gate units learn to open and close access to the constant error flow. LSTM is local in space and time; its computational complexity per time step and weight is O(1). Our experiments with artificial data involve local, distributed, real-valued, and noisy pattern representations. In comparisons with real-time recurrent learning, back propagation through time, recurrent cascade correlation, Elman nets, and neural sequence chunking, LSTM leads to many more successful runs, and learns much faster. LSTM also solves complex, artificial long-time-lag tasks that have never been solved by previous recurrent network algorithms.
            Bookmark
            • Record: found
            • Abstract: found
            • Article: found
            Is Open Access

            SciPy 1.0: fundamental algorithms for scientific computing in Python

            SciPy is an open-source scientific computing library for the Python programming language. Since its initial release in 2001, SciPy has become a de facto standard for leveraging scientific algorithms in Python, with over 600 unique code contributors, thousands of dependent packages, over 100,000 dependent repositories and millions of downloads per year. In this work, we provide an overview of the capabilities and development practices of SciPy 1.0 and highlight some recent technical developments.
              Bookmark
              • Record: found
              • Abstract: found
              • Article: not found

              Search and clustering orders of magnitude faster than BLAST.

              Biological sequence data is accumulating rapidly, motivating the development of improved high-throughput methods for sequence classification. UBLAST and USEARCH are new algorithms enabling sensitive local and global search of large sequence databases at exceptionally high speeds. They are often orders of magnitude faster than BLAST in practical applications, though sensitivity to distant protein relationships is lower. UCLUST is a new clustering method that exploits USEARCH to assign sequences to clusters. UCLUST offers several advantages over the widely used program CD-HIT, including higher speed, lower memory use, improved sensitivity, clustering at lower identities and classification of much larger datasets. Binaries are available at no charge for non-commercial use at http://www.drive5.com/usearch.
                Bookmark

                Author and article information

                Contributors
                nkannan@uga.edu
                Journal
                Nat Commun
                Nat Commun
                Nature Communications
                Nature Publishing Group UK (London )
                2041-1723
                27 September 2021
                27 September 2021
                2021
                : 12
                : 5656
                Affiliations
                [1 ]GRID grid.213876.9, ISNI 0000 0004 1936 738X, Institute of Bioinformatics, University of Georgia, ; Athens, GA USA
                [2 ]GRID grid.213876.9, ISNI 0000 0004 1936 738X, Complex Carbohydrate Research Center, University of Georgia, ; Athens, GA USA
                [3 ]GRID grid.213876.9, ISNI 0000 0004 1936 738X, Department of Computer Science, University of Georgia, ; Athens, GA USA
                [4 ]GRID grid.213876.9, ISNI 0000 0004 1936 738X, Biochemistry and Molecular Biology, University of Georgia, ; Athens, GA USA
                Author information
                http://orcid.org/0000-0003-4471-6759
                http://orcid.org/0000-0003-1768-582X
                http://orcid.org/0000-0002-2833-8375
                Article
                25975
                10.1038/s41467-021-25975-9
                8476585
                34580305
                0a294f12-a74f-45c5-8364-35e269e652ba
                © The Author(s) 2021

                Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

                History
                : 5 April 2021
                : 31 August 2021
                Funding
                Funded by: FundRef https://doi.org/10.13039/100000009, Foundation for the National Institutes of Health (Foundation for the National Institutes of Health, Inc.);
                Award ID: R01GM130915
                Award Recipient :
                Categories
                Article
                Custom metadata
                © The Author(s) 2021

                Uncategorized
                computational models,machine learning,protein sequence analyses,protein structure predictions,software

                Comments

                Comment on this article