31
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      A practical guide to methods controlling false discoveries in computational biology

      research-article

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Background

          In high-throughput studies, hundreds to millions of hypotheses are typically tested. Statistical methods that control the false discovery rate (FDR) have emerged as popular and powerful tools for error rate control. While classic FDR methods use only p values as input, more modern FDR methods have been shown to increase power by incorporating complementary information as informative covariates to prioritize, weight, and group hypotheses. However, there is currently no consensus on how the modern methods compare to one another. We investigate the accuracy, applicability, and ease of use of two classic and six modern FDR-controlling methods by performing a systematic benchmark comparison using simulation studies as well as six case studies in computational biology.

          Results

          Methods that incorporate informative covariates are modestly more powerful than classic approaches, and do not underperform classic approaches, even when the covariate is completely uninformative. The majority of methods are successful at controlling the FDR, with the exception of two modern methods under certain settings. Furthermore, we find that the improvement of the modern FDR methods over the classic methods increases with the informativeness of the covariate, total number of hypothesis tests, and proportion of truly non-null hypotheses.

          Conclusions

          Modern FDR methods that use an informative covariate provide advantages over classic FDR-controlling procedures, with the relative gain dependent on the application and informativeness of available covariates. We present our findings as a practical guide and provide recommendations to aid researchers in their choice of methods to correct for false discoveries.

          Electronic supplementary material

          The online version of this article (10.1186/s13059-019-1716-1) contains supplementary material, which is available to authorized users.

          Related collections

          Most cited references38

          • Record: found
          • Abstract: not found
          • Article: not found

          Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing

            Bookmark
            • Record: found
            • Abstract: found
            • Article: not found

            An algorithm for fast preranked gene set enrichment analysis using cumulative statistic calculation

            Gene set enrichment analysis is a widely used tool for analyzing gene expression data. However, current implementations are slow due to a large number of required samples for the analysis to have a good statistical power. In this paper we present a novel algorithm, that efficiently reuses one sample multiple times and thus speeds up the analysis. We show that it is possible to make hundreds of thousands permutations in a few minutes, which leads to very accurate p-values. This, in turn, allows applying standard FDR correction procedures, which are more accurate than the ones currently used. The method is implemented in a form of an R package and is freely available at \url{https://github.com/ctlab/fgsea}.
              Bookmark
              • Record: found
              • Abstract: found
              • Article: found
              Is Open Access

              A step-by-step workflow for low-level analysis of single-cell RNA-seq data with Bioconductor

              Single-cell RNA sequencing (scRNA-seq) is widely used to profile the transcriptome of individual cells. This provides biological resolution that cannot be matched by bulk RNA sequencing, at the cost of increased technical noise and data complexity. The differences between scRNA-seq and bulk RNA-seq data mean that the analysis of the former cannot be performed by recycling bioinformatics pipelines for the latter. Rather, dedicated single-cell methods are required at various steps to exploit the cellular resolution while accounting for technical noise. This article describes a computational workflow for low-level analyses of scRNA-seq data, based primarily on software packages from the open-source Bioconductor project. It covers basic steps including quality control, data exploration and normalization, as well as more complex procedures such as cell cycle phase assignment, identification of highly variable and correlated genes, clustering into subpopulations and marker gene detection. Analyses were demonstrated on gene-level count data from several publicly available datasets involving haematopoietic stem cells, brain-derived cells, T-helper cells and mouse embryonic stem cells. This will provide a range of usage scenarios from which readers can construct their own analysis pipelines.
                Bookmark

                Author and article information

                Contributors
                keegan@jimmy.harvard.edu
                patrick.kimes@gmail.com
                duvallet@mit.edu
                alejandro.reyes.ds@gmail.com
                ayshwaryasubramanian@gmail.com
                tengmx@gmail.com
                chinmay21191@gmail.com
                ejalm@mit.edu
                shicks19@jhu.edu
                Journal
                Genome Biol
                Genome Biol
                Genome Biology
                BioMed Central (London )
                1474-7596
                1474-760X
                4 June 2019
                4 June 2019
                2019
                : 20
                : 118
                Affiliations
                [1 ]ISNI 0000 0001 2106 9910, GRID grid.65499.37, Department of Data Sciences, Dana-Farber Cancer Institute, ; 450 Brookline Avenue, Boston, 02215 USA
                [2 ]ISNI 000000041936754X, GRID grid.38142.3c, Department of Biostatistics, Harvard T.H. Chan School of Public Health, ; 677 Huntington Avenue, Boston, 02215 USA
                [3 ]ISNI 0000 0001 2341 2786, GRID grid.116068.8, Department of Biological Engineering, MIT, ; 77 Massachusetts Avenue, Cambridge, USA
                [4 ]ISNI 0000 0001 2341 2786, GRID grid.116068.8, Center for Microbiome Informatics and Therapeutics, MIT, ; 77 Massachusetts Avenue, Cambridge, USA
                [5 ]GRID grid.66859.34, Broad Institute, ; 415 Main Street, Cambridge, USA
                [6 ]ISNI 0000 0000 9891 5233, GRID grid.468198.a, Department of Biostatistics & Bioinformatics, Moffitt Cancer Center, ; 12902 Magnolia Drive, Tampa, 33612 USA
                [7 ]ISNI 000000041936754X, GRID grid.38142.3c, Biological and Biomedical Sciences Program, Harvard University, ; Boston, USA
                [8 ]ISNI 0000 0001 2171 9311, GRID grid.21107.35, Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, ; 615 N. Wolfe Street, Baltimore, 21205 USA
                Author information
                http://orcid.org/0000-0002-7858-0231
                Article
                1716
                10.1186/s13059-019-1716-1
                6547503
                31164141
                799e228e-4d11-4732-a6d7-47c90684897b
                © The Author(s) 2019

                Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License( http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver( http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

                History
                : 2 November 2018
                : 10 May 2019
                Funding
                Funded by: FundRef http://dx.doi.org/10.13039/100000051, National Human Genome Research Institute;
                Award ID: U41HG004059
                Funded by: FundRef http://dx.doi.org/10.13039/100000051, National Human Genome Research Institute;
                Award ID: R02HG005220
                Funded by: FundRef http://dx.doi.org/10.13039/100000057, National Institute of General Medical Sciences;
                Award ID: R01GM083084
                Funded by: National Institue of General Medical Sciences
                Award ID: R01GM103552
                Funded by: FundRef http://dx.doi.org/10.13039/100000051, National Human Genome Research Institute;
                Award ID: R00HG009007
                Funded by: Chan Zuckerberg Initiative DAF
                Award ID: 2018-183142
                Funded by: Chan Zuckerberg Initiative DAF
                Award ID: 2018-183560
                Funded by: Broadnext10
                Award ID: Broadnext10
                Funded by: FundRef http://dx.doi.org/10.13039/100000054, National Cancer Institute;
                Award ID: P30CA076292
                Funded by: FundRef http://dx.doi.org/10.13039/100006132, Office of Science;
                Award ID: DE-AC02-05CH11231
                Categories
                Research
                Custom metadata
                © The Author(s) 2019

                Genetics
                multiple hypothesis testing,false discovery rate,rna-seq,scrna-seq,chip-seq,microbiome,gwas,gene set analysis

                Comments

                Comment on this article