Improved RNA secondary structure and tertiary base-pairing prediction using evolutionary profile, mutational coupling and two-dimensional transfer learning

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

Motivation

The recent discovery of numerous non-coding RNAs (long non-coding RNAs, in particular) has transformed our perception about the roles of RNAs in living organisms. Our ability to understand them, however, is hampered by our inability to solve their secondary and tertiary structures in high resolution efficiently by existing experimental techniques. Computational prediction of RNA secondary structure, on the other hand, has received much-needed improvement, recently, through deep learning of a large approximate data, followed by transfer learning with gold-standard base-pairing structures from high-resolution 3-D structures. Here, we expand this single-sequence-based learning to the use of evolutionary profiles and mutational coupling.

Results

The new method allows large improvement not only in canonical base-pairs (RNA secondary structures) but more so in base-pairing associated with tertiary interactions such as pseudoknots, non-canonical and lone base-pairs. In particular, it is highly accurate for those RNAs of more than 1000 homologous sequences by achieving >0.8 F1-score (harmonic mean of sensitivity and precision) for 14/16 RNAs tested. The method can also significantly improve base-pairing prediction by incorporating artificial but functional homologous sequences generated from deep mutational scanning without any modification. The fully automatic method (publicly available as server and standalone software) should provide the scientific community a new powerful tool to capture not only the secondary structure but also tertiary base-pairing information for building three-dimensional models. It also highlights the future of accurately solving the base-pairing structure by using a large number of natural and/or artificial homologous sequences.

Availability and implementation

Standalone-version of SPOT-RNA2 is available at https://github.com/jaswindersingh2/SPOT-RNA2. Direct prediction can also be made at https://sparks-lab.org/server/spot-rna2/. The datasets used in this research can also be downloaded from the GITHUB and the webserver mentioned above.

Supplementary information

Supplementary data are available at Bioinformatics online.

Related collections

Most cited references 68

Record: found
Abstract: found
Article: not found

Long Short-Term Memory

Jürgen Schmidhuber, Jürgen Schmidhuber (2002)

Learning to store information over extended time intervals by recurrent backpropagation takes a very long time, mostly because of insufficient, decaying error backflow. We briefly review Hochreiter's (1991) analysis of this problem, then address it by introducing a novel, efficient, gradient-based method called long short-term memory (LSTM). Truncating the gradient where this does not do harm, LSTM can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units. Multiplicative gate units learn to open and close access to the constant error flow. LSTM is local in space and time; its computational complexity per time step and weight is O(1). Our experiments with artificial data involve local, distributed, real-valued, and noisy pattern representations. In comparisons with real-time recurrent learning, back propagation through time, recurrent cascade correlation, Elman nets, and neural sequence chunking, LSTM leads to many more successful runs, and learns much faster. LSTM also solves complex, artificial long-time-lag tasks that have never been solved by previous recurrent network algorithms.

0 comments Cited 6623 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.

S Altschul (1997)

The BLAST programs are widely used tools for searching protein and DNA databases for sequence similarities. For protein comparisons, a variety of definitional, algorithmic and statistical refinements described here permits the execution time of the BLAST programs to be decreased substantially while enhancing their sensitivity to weak similarities. A new criterion for triggering the extension of word hits, combined with a new heuristic for generating gapped alignments, yields a gapped BLAST program that runs at approximately three times the speed of the original. In addition, a method is introduced for automatically combining statistically significant alignments produced by BLAST into a position-specific score matrix, and searching the database using this matrix. The resulting Position-Specific Iterated BLAST (PSI-BLAST) program runs at approximately the same speed per iteration as gapped BLAST, but in many cases is much more sensitive to weak but biologically relevant sequence similarities. PSI-BLAST is used to uncover several new and interesting members of the BRCT superfamily.

0 comments Cited 4221 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: found

Is Open Access

CD-HIT: accelerated for clustering the next-generation sequencing data

Limin Fu, Beifang Niu, Zhengwei Zhu … (2012)

Summary: CD-HIT is a widely used program for clustering biological sequences to reduce sequence redundancy and improve the performance of other sequence analyses. In response to the rapid increase in the amount of sequencing data produced by the next-generation sequencing technologies, we have developed a new CD-HIT program accelerated with a novel parallelization strategy and some other techniques to allow efficient clustering of such datasets. Our tests demonstrated very good speedup derived from the parallelization for up to ∼24 cores and a quasi-linear speedup for up to ∼8 cores. The enhanced CD-HIT is capable of handling very large datasets in much shorter time than previous versions. Availability: http://cd-hit.org. Contact: liwz@sdsc.edu Supplementary information: Supplementary data are available at Bioinformatics online.

0 comments Cited 2380 times – based on 0 reviews      Review now

Bookmark

All references

Author and article information

Contributors

Jaswinder Singh: (View ORCID Profile)

Tongchuan Zhang: (View ORCID Profile)

Jaspreet Singh: (View ORCID Profile)

Yaoqi Zhou: (View ORCID Profile)

Journal

Title: Bioinformatics

Publisher: Oxford University Press (OUP)

ISSN (Print): 1367-4803

ISSN (Electronic): 1460-2059

Publication date Created: March 11 2021

Publication date (Electronic): March 11 2021

Affiliations

[1 ]Signal Processing Laboratory, School of Engineering and Built Environment, Griffith University, Brisbane, QLD 4111, Australia

[2 ]Institute for Glycomics and School of Information and Communication Technology, Griffith University, Southport, QLD, 4222, Australia

Article

DOI: 10.1093/bioinformatics/btab165

PubMed ID: 33704363

SO-VID: e073cefb-bd19-49c5-80cf-e7e3048077d3

License:

https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model

Improved RNA secondary structure and tertiary base-pairing prediction using evolutionary profile, mutational coupling and two-dimensional transfer learning

Read this article at

Abstract

Motivation

Results

Availability and implementation

Supplementary information

Related collections

Evolutionary Cell Biology

Most cited references 68

Long Short-Term Memory

Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.

CD-HIT: accelerated for clustering the next-generation sequencing data

Author and article information

Contributors

Journal

Affiliations

Article

History

Comments

Comment on this article

Similar content 43

Cited by 19

Most referenced authors 3,107