1
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      Sequence similarity governs generalizability of de novo deep learning models for RNA secondary structure prediction

      research-article
      * ,
      PLOS Computational Biology
      Public Library of Science

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Making no use of physical laws or co-evolutionary information, de novo deep learning (DL) models for RNA secondary structure prediction have achieved far superior performances than traditional algorithms. However, their statistical underpinning raises the crucial question of generalizability. We present a quantitative study of the performance and generalizability of a series of de novo DL models, with a minimal two-module architecture and no post-processing, under varied similarities between seen and unseen sequences. Our models demonstrate excellent expressive capacities and outperform existing methods on common benchmark datasets. However, model generalizability, i.e., the performance gap between the seen and unseen sets, degrades rapidly as the sequence similarity decreases. The same trends are observed from several recent DL and machine learning models. And an inverse correlation between performance and generalizability is revealed collectively across all learning-based models with wide-ranging architectures and sizes. We further quantitate how generalizability depends on sequence and structure identity scores via pairwise alignment, providing unique quantitative insights into the limitations of statistical learning. Generalizability thus poses a major hurdle for deploying de novo DL models in practice and various pathways for future advances are discussed.

          Author summary

          Learning-based de novo models of RNA secondary structures critically rely on training data to associate sequences with structures. The practical utility of such models thus hinges on not only training performances but generalizations over unseen sequences. Model generalizability, however, remains poorly understood. By delineating sequence similarity at three distinct levels, we develop a series of DL models and evaluate their performance and generalizability, as well as several current DL and machine learning models. First establishing the decisive role of sequence similarity in generalizability, we further quantitate their dependencies via pairwise sequence and structure alignment. The gained quantitative insights make valuable guidelines for deploying DL models in practice and advocate RNA secondary structure as a unique platform for developing generalizable learning-based models.

          Related collections

          Most cited references66

          • Record: found
          • Abstract: found
          • Article: found
          Is Open Access

          Highly accurate protein structure prediction with AlphaFold

          Proteins are essential to life, and understanding their structure can facilitate a mechanistic understanding of their function. Through an enormous experimental effort 1 – 4 , the structures of around 100,000 unique proteins have been determined 5 , but this represents a small fraction of the billions of known protein sequences 6 , 7 . Structural coverage is bottlenecked by the months to years of painstaking effort required to determine a single protein structure. Accurate computational approaches are needed to address this gap and to enable large-scale structural bioinformatics. Predicting the three-dimensional structure that a protein will adopt based solely on its amino acid sequence—the structure prediction component of the ‘protein folding problem’ 8 —has been an important open research problem for more than 50 years 9 . Despite recent progress 10 – 14 , existing methods fall far short of atomic accuracy, especially when no homologous structure is available. Here we provide the first computational method that can regularly predict protein structures with atomic accuracy even in cases in which no similar structure is known. We validated an entirely redesigned version of our neural network-based model, AlphaFold, in the challenging 14th Critical Assessment of protein Structure Prediction (CASP14) 15 , demonstrating accuracy competitive with experimental structures in a majority of cases and greatly outperforming other methods. Underpinning the latest version of AlphaFold is a novel machine learning approach that incorporates physical and biological knowledge about protein structure, leveraging multi-sequence alignments, into the design of the deep learning algorithm. AlphaFold predicts protein structures with an accuracy competitive with experimental structures in the majority of cases using a novel deep learning architecture.
            Bookmark
            • Record: found
            • Abstract: found
            • Article: found
            Is Open Access

            BLAST+: architecture and applications

            Background Sequence similarity searching is a very important bioinformatics task. While Basic Local Alignment Search Tool (BLAST) outperforms exact methods through its use of heuristics, the speed of the current BLAST software is suboptimal for very long queries or database sequences. There are also some shortcomings in the user-interface of the current command-line applications. Results We describe features and improvements of rewritten BLAST software and introduce new command-line applications. Long query sequences are broken into chunks for processing, in some cases leading to dramatically shorter run times. For long database sequences, it is possible to retrieve only the relevant parts of the sequence, reducing CPU time and memory usage for searches of short queries against databases of contigs or chromosomes. The program can now retrieve masking information for database sequences from the BLAST databases. A new modular software library can now access subject sequence data from arbitrary data sources. We introduce several new features, including strategy files that allow a user to save and reuse their favorite set of options. The strategy files can be uploaded to and downloaded from the NCBI BLAST web site. Conclusion The new BLAST command-line applications, compared to the current BLAST tools, demonstrate substantial speed improvements for long queries as well as chromosome length database sequences. We have also improved the user interface of the command-line applications.
              Bookmark
              • Record: found
              • Abstract: found
              • Article: found
              Is Open Access

              CD-HIT: accelerated for clustering the next-generation sequencing data

              Summary: CD-HIT is a widely used program for clustering biological sequences to reduce sequence redundancy and improve the performance of other sequence analyses. In response to the rapid increase in the amount of sequencing data produced by the next-generation sequencing technologies, we have developed a new CD-HIT program accelerated with a novel parallelization strategy and some other techniques to allow efficient clustering of such datasets. Our tests demonstrated very good speedup derived from the parallelization for up to ∼24 cores and a quasi-linear speedup for up to ∼8 cores. The enhanced CD-HIT is capable of handling very large datasets in much shorter time than previous versions. Availability: http://cd-hit.org. Contact: liwz@sdsc.edu Supplementary information: Supplementary data are available at Bioinformatics online.
                Bookmark

                Author and article information

                Contributors
                Role: ConceptualizationRole: Data curationRole: Formal analysisRole: InvestigationRole: MethodologyRole: Project administrationRole: ResourcesRole: SoftwareRole: ValidationRole: VisualizationRole: Writing – original draftRole: Writing – review & editing
                Role: Editor
                Journal
                PLoS Comput Biol
                PLoS Comput Biol
                plos
                PLOS Computational Biology
                Public Library of Science (San Francisco, CA USA )
                1553-734X
                1553-7358
                17 April 2023
                April 2023
                : 19
                : 4
                : e1011047
                Affiliations
                [001] Department of Physics, George Washington University, Washington DC, United States of America
                University of Rochester, UNITED STATES
                Author notes

                The author declares no conflict of interest.

                Article
                PCOMPBIOL-D-23-00003
                10.1371/journal.pcbi.1011047
                10138783
                37068100
                0600717b-c967-4401-bb60-e4cc1a40546f
                © 2023 Xiangyun Qiu

                This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

                History
                : 2 January 2023
                : 25 March 2023
                Page count
                Figures: 6, Tables: 0, Pages: 26
                Funding
                The author received no specific funding for this work.
                Categories
                Research Article
                Biology and life sciences
                Molecular biology
                Macromolecular structure analysis
                RNA structure
                Biology and life sciences
                Biochemistry
                Nucleic acids
                RNA
                RNA structure
                Biology and life sciences
                Biochemistry
                Nucleic acids
                RNA
                RNA sequences
                Research and Analysis Methods
                Database and Informatics Methods
                Bioinformatics
                Sequence Analysis
                Sequence Alignment
                Computer and Information Sciences
                Artificial Intelligence
                Machine Learning
                Biology and Life Sciences
                Neuroscience
                Cognitive Science
                Cognitive Psychology
                Learning
                Biology and Life Sciences
                Psychology
                Cognitive Psychology
                Learning
                Social Sciences
                Psychology
                Cognitive Psychology
                Learning
                Biology and Life Sciences
                Neuroscience
                Learning and Memory
                Learning
                Biology and life sciences
                Molecular biology
                Macromolecular structure analysis
                RNA structure
                RNA folding
                Biology and life sciences
                Biochemistry
                Nucleic acids
                RNA
                RNA structure
                RNA folding
                Biology and life sciences
                Molecular biology
                Macromolecular structure analysis
                RNA structure
                RNA alignment
                Biology and life sciences
                Biochemistry
                Nucleic acids
                RNA
                RNA structure
                RNA alignment
                Biology and life sciences
                Molecular biology
                Macromolecular structure analysis
                RNA structure
                RNA structure prediction
                Biology and life sciences
                Biochemistry
                Nucleic acids
                RNA
                RNA structure
                RNA structure prediction
                Custom metadata
                vor-update-to-uncorrected-proof
                2023-04-27
                All code written in support of this publication is publicly available at https://github.com/qiuresearch/SeqFold2D. All datasets used are from publically available databases as described in the main text.

                Quantitative & Systems biology
                Quantitative & Systems biology

                Comments

                Comment on this article