Mapping the glycosyltransferase fold landscape using interpretable deep learning

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

Glycosyltransferases (GTs) play fundamental roles in nearly all cellular processes through the biosynthesis of complex carbohydrates and glycosylation of diverse protein and small molecule substrates. The extensive structural and functional diversification of GTs presents a major challenge in mapping the relationships connecting sequence, structure, fold and function using traditional bioinformatics approaches. Here, we present a convolutional neural network with attention (CNN-attention) based deep learning model that leverages simple secondary structure representations generated from primary sequences to provide GT fold prediction with high accuracy. The model learns distinguishing secondary structure features free of primary sequence alignment constraints and is highly interpretable. It delineates sequence and structural features characteristic of individual fold types, while classifying them into distinct clusters that group evolutionarily divergent families based on shared secondary structural features. We further extend our model to classify GT families of unknown folds and variants of known folds. By identifying families that are likely to adopt novel folds such as GT91, GT96 and GT97, our studies expand the GT fold landscape and prioritize targets for future structural studies.

Abstract

Glycosyltransferases (GT) are proteins that display extensive sequence and functional variation on a subset of 3D folds. Here, the authors use interpretable deep learning to predict 3D folds from sequence without the need for sequence alignment, which also enables the prediction of GTs with new folds.

Related collections

Most cited references 41

Record: found
Abstract: found
Article: not found

Long Short-Term Memory

Jürgen Schmidhuber, Jürgen Schmidhuber (2002)

Learning to store information over extended time intervals by recurrent backpropagation takes a very long time, mostly because of insufficient, decaying error backflow. We briefly review Hochreiter's (1991) analysis of this problem, then address it by introducing a novel, efficient, gradient-based method called long short-term memory (LSTM). Truncating the gradient where this does not do harm, LSTM can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units. Multiplicative gate units learn to open and close access to the constant error flow. LSTM is local in space and time; its computational complexity per time step and weight is O(1). Our experiments with artificial data involve local, distributed, real-valued, and noisy pattern representations. In comparisons with real-time recurrent learning, back propagation through time, recurrent cascade correlation, Elman nets, and neural sequence chunking, LSTM leads to many more successful runs, and learns much faster. LSTM also solves complex, artificial long-time-lag tasks that have never been solved by previous recurrent network algorithms.

0 comments Cited 6810 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: found

Is Open Access

SciPy 1.0: fundamental algorithms for scientific computing in Python

Pauli Virtanen, Ralf Gommers, Travis E. Oliphant … (2020)

SciPy is an open-source scientific computing library for the Python programming language. Since its initial release in 2001, SciPy has become a de facto standard for leveraging scientific algorithms in Python, with over 600 unique code contributors, thousands of dependent packages, over 100,000 dependent repositories and millions of downloads per year. In this work, we provide an overview of the capabilities and development practices of SciPy 1.0 and highlight some recent technical developments.

0 comments Cited 6248 times     Rated -3 of 5. – based on 1 reviews

Bookmark

Record: found
Abstract: found
Article: not found

Search and clustering orders of magnitude faster than BLAST.

Robert Edgar (2010)

Biological sequence data is accumulating rapidly, motivating the development of improved high-throughput methods for sequence classification. UBLAST and USEARCH are new algorithms enabling sensitive local and global search of large sequence databases at exceptionally high speeds. They are often orders of magnitude faster than BLAST in practical applications, though sensitivity to distant protein relationships is lower. UCLUST is a new clustering method that exploits USEARCH to assign sequences to clusters. UCLUST offers several advantages over the widely used program CD-HIT, including higher speed, lower memory use, improved sensitivity, clustering at lower identities and classification of much larger datasets. Binaries are available at no charge for non-commercial use at http://www.drive5.com/usearch.

0 comments Cited 3594 times – based on 0 reviews      Review now

Bookmark

All references

Author and article information

Contributors

Natarajan Kannan:

ORCID: http://orcid.org/0000-0002-2833-8375

nkannan@uga.edu

Journal

Journal ID (nlm-ta): Nat Commun

Journal ID (iso-abbrev): Nat Commun

Title: Nature Communications

Publisher: Nature Publishing Group UK (London )

ISSN (Electronic): 2041-1723

Publication date (Electronic): 27 September 2021

Publication date PMC-release: 27 September 2021

Publication date Collection: 2021

Volume: 12

Electronic Location Identifier: 5656

Affiliations

[1 ]GRID grid.213876.9, ISNI 0000 0004 1936 738X, Institute of Bioinformatics, University of Georgia, ; Athens, GA USA

[2 ]GRID grid.213876.9, ISNI 0000 0004 1936 738X, Complex Carbohydrate Research Center, University of Georgia, ; Athens, GA USA

[3 ]GRID grid.213876.9, ISNI 0000 0004 1936 738X, Department of Computer Science, University of Georgia, ; Athens, GA USA

[4 ]GRID grid.213876.9, ISNI 0000 0004 1936 738X, Biochemistry and Molecular Biology, University of Georgia, ; Athens, GA USA

Author information

Zhongliang Zhou http://orcid.org/0000-0003-4471-6759

Kelley W. Moremen http://orcid.org/0000-0003-1768-582X

Natarajan Kannan http://orcid.org/0000-0002-2833-8375

Article

Publisher ID: 25975

DOI: 10.1038/s41467-021-25975-9

PMC ID: 8476585

PubMed ID: 34580305

SO-VID: 0a294f12-a74f-45c5-8364-35e269e652ba

License:

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

History

Date received : 5 April 2021

Date accepted : 31 August 2021

Funding

Funded by: FundRef https://doi.org/10.13039/100000009, Foundation for the National Institutes of Health (Foundation for the National Institutes of Health, Inc.);

Award ID: R01GM130915

Award Recipient : Natarajan Kannan

Custom metadata

ScienceOpen disciplines: Uncategorized

Keywords: computational models,machine learning,protein sequence analyses,protein structure predictions,software

Data availability:

ScienceOpen disciplines: Uncategorized

Keywords: computational models, machine learning, protein sequence analyses, protein structure predictions, software

Mapping the glycosyltransferase fold landscape using interpretable deep learning

Read this article at

Abstract

Abstract

Related collections

Journal of Information and Communication Technology

Most cited references 41

Long Short-Term Memory

SciPy 1.0: fundamental algorithms for scientific computing in Python

Search and clustering orders of magnitude faster than BLAST.

Author and article information

Contributors

Journal

Affiliations

Author information

Article

History

Funding

Categories

Custom metadata

Comments

Comment on this article

Similar content 219

Cited by 15

Most referenced authors 3,930