33
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      Sustainable data analysis with Snakemake

      methods-article

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Data analysis often entails a multitude of heterogeneous steps, from the application of various command line tools to the usage of scripting languages like R or Python for the generation of plots and tables. It is widely recognized that data analyses should ideally be conducted in a reproducible way. Reproducibility enables technical validation and regeneration of results on the original or even new data. However, reproducibility alone is by no means sufficient to deliver an analysis that is of lasting impact (i.e., sustainable) for the field, or even just one research group. We postulate that it is equally important to ensure adaptability and transparency. The former describes the ability to modify the analysis to answer extended or slightly different research questions. The latter describes the ability to understand the analysis in order to judge whether it is not only technically, but methodologically valid.

          Here, we analyze the properties needed for a data analysis to become reproducible, adaptable, and transparent. We show how the popular workflow management system Snakemake can be used to guarantee this, and how it enables an ergonomic, combined, unified representation of all steps involved in data analysis, ranging from raw data processing, to quality control and fine-grained, interactive exploration and plotting of final results.

          Related collections

          Most cited references40

          • Record: found
          • Abstract: found
          • Article: found
          Is Open Access

          The Galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2018 update

          Abstract Galaxy (homepage: https://galaxyproject.org, main public server: https://usegalaxy.org) is a web-based scientific analysis platform used by tens of thousands of scientists across the world to analyze large biomedical datasets such as those found in genomics, proteomics, metabolomics and imaging. Started in 2005, Galaxy continues to focus on three key challenges of data-driven biomedical science: making analyses accessible to all researchers, ensuring analyses are completely reproducible, and making it simple to communicate analyses so that they can be reused and extended. During the last two years, the Galaxy team and the open-source community around Galaxy have made substantial improvements to Galaxy's core framework, user interface, tools, and training materials. Framework and user interface improvements now enable Galaxy to be used for analyzing tens of thousands of datasets, and >5500 tools are now available from the Galaxy ToolShed. The Galaxy community has led an effort to create numerous high-quality tutorials focused on common types of genomic analyses. The Galaxy developer and user communities continue to grow and be integral to Galaxy's development. The number of Galaxy public servers, developers contributing to the Galaxy framework and its tools, and users of the main Galaxy server have all increased substantially.
            Bookmark
            • Record: found
            • Abstract: not found
            • Conference Proceedings: not found

            Data Structures for Statistical Computing in Python

              Bookmark
              • Record: found
              • Abstract: not found
              • Article: not found

              Nextflow enables reproducible computational workflows

                Bookmark

                Author and article information

                Contributors
                Role: MethodologyRole: SoftwareRole: Writing – Review & Editing
                Role: MethodologyRole: SoftwareRole: Writing – Review & Editing
                Role: MethodologyRole: SoftwareRole: Writing – Review & Editing
                Role: MethodologyRole: Software
                Role: MethodologyRole: SoftwareRole: Writing – Review & Editing
                Role: MethodologyRole: SoftwareRole: Writing – Review & Editing
                Role: MethodologyRole: Software
                Role: MethodologyRole: Software
                Role: MethodologyRole: Software
                Role: MethodologyRole: Software
                Role: MethodologyRole: SoftwareRole: Writing – Review & Editing
                Role: MethodologyRole: Software
                Role: SupervisionRole: Writing – Review & Editing
                Role: Conceptualization
                Role: ConceptualizationRole: Data CurationRole: Formal AnalysisRole: Funding AcquisitionRole: InvestigationRole: MethodologyRole: Project AdministrationRole: ResourcesRole: SoftwareRole: SupervisionRole: ValidationRole: VisualizationRole: Writing – Original Draft PreparationRole: Writing – Review & Editing
                Journal
                F1000Res
                F1000Res
                F1000Research
                F1000Research
                F1000 Research Limited (London, UK )
                2046-1402
                19 April 2021
                2021
                : 10
                : 33
                Affiliations
                [1 ]Algorithms for reproducible bioinformatics, Genome Informatics, Institute of Human Genetics, University Hospital Essen, University of Duisburg-Essen, Essen, Germany
                [2 ]Institute of Pathology, University Hospital Essen, University of Duisburg-Essen, Essen, Germany
                [3 ]Department of Biosystems Science and Engineering, ETH Zurich, Basel, Switzerland
                [4 ]Swiss Institute of Bioinformatics (SIB), Basel, Switzerland
                [5 ]EMBL-EBI, Hinxton, UK
                [6 ]Broad Institute of MIT and Harvard, Cambridge, USA
                [7 ]Department of Organismic and Evolutionary Biology, Harvard University, Cambridge, USA
                [8 ]Stanford University Research Computing Center, Stanford University, Stanford, USA
                [9 ]German Cancer Consortium (DKTK, partner site Essen) and German Cancer Research Center, DKFZ, Heidelberg, Germany
                [10 ]Biomedical Informatics, Harvard Medical School, Harvard University, Boston, USA
                [11 ]Charité - Universitätsmedizin Berlin, corporate member of Freie Universität Berlin, Humboldt-Universität zu Berlin, and Berlin Institute of Health (BIH), Center for Digital Health, Berlin, Germany
                [12 ]Biozentrum, University of Basel, Basel, Switzerland
                [13 ]SIB Swiss Institute of Bioinformatics / ELIXIR Switzerland, Lausanne, Switzerland
                [14 ]Microsoft Singapore, Singapore, Singapore
                [15 ]CUBI – Core Unit Bioinformatics, Berlin Institute of Health, Berlin, Germany
                [16 ]Genome Informatics, Institute of Human Genetics, University Hospital Essen, University of Duisburg-Essen, Essen, Germany
                [17 ]Quantitative Biology Center (QBiC), University of Tübingen, Tübingen, Germany
                [18 ]Medical Oncology, Harvard Medical School, Harvard University, Boston, USA
                [1 ]Institute of Informatics, LMU Munich, Munich, Germany
                [1 ]Institute of Informatics, LMU Munich, Munich, Germany
                University of Duisburg-Essen, Germany
                [1 ]Department of Medicine, University of California, San Diego, San Diego, CA, USA
                University of Duisburg-Essen, Germany
                Author notes

                No competing interests were disclosed.

                Competing interests: No competing interests were disclosed.

                Competing interests: No competing interests were disclosed.

                Competing interests: No competing interests were disclosed.

                Competing interests: No competing interests were disclosed.

                Competing interests: No competing interests were disclosed.

                Author information
                https://orcid.org/0000-0002-3976-9701
                https://orcid.org/0000-0002-4166-4343
                https://orcid.org/0000-0002-8921-6005
                https://orcid.org/0000-0003-3683-6208
                https://orcid.org/0000-0002-9114-6421
                https://orcid.org/0000-0002-4387-3819
                https://orcid.org/0000-0002-3594-6213
                https://orcid.org/0000-0002-3468-0652
                https://orcid.org/0000-0001-9818-9320
                Article
                10.12688/f1000research.29032.2
                8114187
                34035898
                e3980468-5c69-4b39-a02a-059dbea3ab4d
                Copyright: © 2021 Mölder F et al.

                This is an open access article distributed under the terms of the Creative Commons Attribution Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

                History
                : 8 April 2021
                Funding
                Funded by: United States National Science Foundation Graduate Research Fellowship Program (NSF-GRFP)
                Award ID: 1745303
                Funded by: Google LLC
                Funded by: Netherlands Organisation for Scientific Research (NWO)
                Award ID: VENIgrant016.Veni.173.076
                Funded by: Deutsche Stiftung für Herzforschung
                Award ID: SFB876
                This work was supported by the Netherlands Organisation for Scientific Research (NWO) (VENI grant 016.Veni.173.076, Johannes Köster), the German Research Foundation (SFB 876, Johannes Köster and Sven Rahmann), the United States National Science Foundation Graduate Research Fellowship Program (NSF-GRFP) (Grant No. 1745303, Christopher Tomkins-Tinch), and Google LLC (Vanessa Sochat and Johannes Köster).
                The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
                Categories
                Method Article
                Articles

                data analysis,workflow management,sustainability,reproducibility,transparency,adaptability,scalability

                Comments

                Comment on this article