2
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      iPro-WAEL: a comprehensive and robust framework for identifying promoters in multiple species

      research-article
      , ,
      Nucleic Acids Research
      Oxford University Press

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Promoters are consensus DNA sequences located near the transcription start sites and they play an important role in transcription initiation. Due to their importance in biological processes, the identification of promoters is significantly important for characterizing the expression of the genes. Numerous computational methods have been proposed to predict promoters. However, it is difficult for these methods to achieve satisfactory performance in multiple species. In this study, we propose a novel weighted average ensemble learning model, termed iPro-WAEL, for identifying promoters in multiple species, including Human, Mouse, E.coli, Arabidopsis, B.amyloliquefaciens, B.subtilis and R.capsulatus. Extensive benchmarking experiments illustrate that iPro-WAEL has optimal performance and is superior to the current methods in promoter prediction. The experimental results also demonstrate a satisfactory prediction ability of iPro-WAEL on cross-cell lines, promoters annotated by other methods and distinguishing between promoters and enhancers. Moreover, we identify the most important transcription factor binding site (TFBS) motif in promoter regions to facilitate the study of identifying important motifs in the promoter regions. The source code of iPro-WAEL is freely available at https://github.com/HaoWuLab-Bioinformatics/iPro-WAEL.

          Related collections

          Most cited references59

          • Record: found
          • Abstract: found
          • Article: found
          Is Open Access

          BEDTools: a flexible suite of utilities for comparing genomic features

          Motivation: Testing for correlations between different sets of genomic features is a fundamental task in genomics research. However, searching for overlaps between features with existing web-based methods is complicated by the massive datasets that are routinely produced with current sequencing technologies. Fast and flexible tools are therefore required to ask complex questions of these data in an efficient manner. Results: This article introduces a new software suite for the comparison, manipulation and annotation of genomic features in Browser Extensible Data (BED) and General Feature Format (GFF) format. BEDTools also supports the comparison of sequence alignments in BAM format to both BED and GFF features. The tools are extremely efficient and allow the user to compare large datasets (e.g. next-generation sequencing data) with both public and custom genome annotation tracks. BEDTools can be combined with one another as well as with standard UNIX commands, thus facilitating routine genomics tasks as well as pipelines that can quickly answer intricate questions of large genomic datasets. Availability and implementation: BEDTools was written in C++. Source code and a comprehensive user manual are freely available at http://code.google.com/p/bedtools Contact: aaronquinlan@gmail.com; imh4y@virginia.edu Supplementary information: Supplementary data are available at Bioinformatics online.
            Bookmark
            • Record: found
            • Abstract: found
            • Article: not found

            Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences.

            In 2001 and 2002, we published two papers (Bioinformatics, 17, 282-283, Bioinformatics, 18, 77-82) describing an ultrafast protein sequence clustering program called cd-hit. This program can efficiently cluster a huge protein database with millions of sequences. However, the applications of the underlying algorithm are not limited to only protein sequences clustering, here we present several new programs using the same algorithm including cd-hit-2d, cd-hit-est and cd-hit-est-2d. Cd-hit-2d compares two protein datasets and reports similar matches between them; cd-hit-est clusters a DNA/RNA sequence database and cd-hit-est-2d compares two nucleotide datasets. All these programs can handle huge datasets with millions of sequences and can be hundreds of times faster than methods based on the popular sequence comparison and database search tools, such as BLAST.
              Bookmark
              • Record: found
              • Abstract: not found
              • Article: not found

              ChromHMM: automating chromatin-state discovery and characterization.

                Bookmark

                Author and article information

                Contributors
                Journal
                Nucleic Acids Res
                Nucleic Acids Res
                nar
                Nucleic Acids Research
                Oxford University Press
                0305-1048
                1362-4962
                14 October 2022
                26 September 2022
                26 September 2022
                : 50
                : 18
                : 10278-10289
                Affiliations
                School of Software, Shandong University , Jinan, 250101, Shandong, China
                College of Information Engineering, Northwest A&F University , Yangling, 712100, Shaanxi, China
                College of Information Engineering, Northwest A&F University , Yangling, 712100, Shaanxi, China
                School of Software, Shandong University , Jinan, 250101, Shandong, China
                Author notes
                To whom correspondence should be addressed. Tel: +86 18254105536; Fax: +86 053188391686; Email: haowu@ 123456sdu.edu.cn
                Author information
                https://orcid.org/0000-0001-8696-4983
                https://orcid.org/0000-0003-4605-8577
                https://orcid.org/0000-0003-2340-9258
                Article
                gkac824
                10.1093/nar/gkac824
                9561371
                36161334
                78ac7f8e-f716-4eb8-a81d-09ec51238119
                © The Author(s) 2022. Published by Oxford University Press on behalf of Nucleic Acids Research.

                This is an Open Access article distributed under the terms of the Creative Commons Attribution-NonCommercial License ( https://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@ 123456oup.com

                History
                : 14 September 2022
                : 24 August 2022
                : 18 July 2022
                Page count
                Pages: 12
                Funding
                Funded by: National Natural Science Foundation of China, DOI 10.13039/501100001809;
                Award ID: 62272278
                Award ID: 61972322
                Funded by: National Key Research and Development Program, DOI 10.13039/501100012166;
                Award ID: 2021YFF0704103
                Funded by: Natural Science Foundation of Shaanxi Province, DOI 10.13039/501100007128;
                Award ID: 2021JM110
                Funded by: Shandong University, DOI 10.13039/100009108;
                Categories
                AcademicSubjects/SCI00010
                Narese/7
                Narese/24
                Computational Biology

                Genetics
                Genetics

                Comments

                Comment on this article