8
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      scPipe: A flexible R/Bioconductor preprocessing pipeline for single-cell RNA-sequencing data

      research-article

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Single-cell RNA sequencing (scRNA-seq) technology allows researchers to profile the transcriptomes of thousands of cells simultaneously. Protocols that incorporate both designed and random barcodes have greatly increased the throughput of scRNA-seq, but give rise to a more complex data structure. There is a need for new tools that can handle the various barcoding strategies used by different protocols and exploit this information for quality assessment at the sample-level and provide effective visualization of these results in preparation for higher-level analyses. To this end, we developed scPipe, an R/Bioconductor package that integrates barcode demultiplexing, read alignment, UMI-aware gene-level quantification and quality control of raw sequencing data generated by multiple protocols that include CEL-seq, MARS-seq, Chromium 10X, Drop-seq and Smart-seq. scPipe produces a count matrix that is essential for downstream analysis along with an HTML report that summarises data quality. These results can be used as input for downstream analyses including normalization, visualization and statistical testing. scPipe performs this processing in a few simple R commands, promoting reproducible analysis of single-cell data that is compatible with the emerging suite of open-source scRNA-seq analysis tools available in R/Bioconductor and beyond. The scPipe R package is available for download from https://www.bioconductor.org/packages/scPipe.

          Author summary

          Biotechnologies that allow researchers to measure gene activity in individual cells are growing in popularity. This has resulted in an avalanche of custom analysis methods designed to deal with the complex data that arises from this technology. Although hundreds of analysis methods are available, relatively few deal with raw data processing in a holistic way. Our scPipe software has been developed to fill this gap. scPipe is the first fully integrated R package that deals with the raw sequencing reads from single cell gene expression studies, processing them to the point where biologically interesting downstream analyses can take place. By following community developed standards, scPipe is compatible with many other software packages for single cell analysis available from the open-source Bioconductor project, facilitating a complete beginning to end analysis of single cell gene expression data. This allows various biological questions to be answered, ranging from the identification of novel cell types to the discovery of new marker genes. scPipe promotes reproducibility and makes it easier for researchers to share results and code.

          Related collections

          Most cited references12

          • Record: found
          • Abstract: found
          • Article: found
          Is Open Access

          featureCounts: An efficient general-purpose program for assigning sequence reads to genomic features

          , , (2013)
          Next-generation sequencing technologies generate millions of short sequence reads, which are usually aligned to a reference genome. In many applications, the key information required for downstream analysis is the number of reads mapping to each genomic feature, for example to each exon or each gene. The process of counting reads is called read summarization. Read summarization is required for a great variety of genomic analyses but has so far received relatively little attention in the literature. We present featureCounts, a read summarization program suitable for counting reads generated from either RNA or genomic DNA sequencing experiments. featureCounts implements highly efficient chromosome hashing and feature blocking techniques. It is considerably faster than existing methods (by an order of magnitude for gene-level summarization) and requires far less computer memory. It works with either single or paired-end reads and provides a wide range of options appropriate for different sequencing applications. featureCounts is available under GNU General Public License as part of the Subread (http://subread.sourceforge.net) or Rsubread (http://www.bioconductor.org) software packages.
            Bookmark
            • Record: found
            • Abstract: found
            • Article: found
            Is Open Access

            A step-by-step workflow for low-level analysis of single-cell RNA-seq data with Bioconductor

            Single-cell RNA sequencing (scRNA-seq) is widely used to profile the transcriptome of individual cells. This provides biological resolution that cannot be matched by bulk RNA sequencing, at the cost of increased technical noise and data complexity. The differences between scRNA-seq and bulk RNA-seq data mean that the analysis of the former cannot be performed by recycling bioinformatics pipelines for the latter. Rather, dedicated single-cell methods are required at various steps to exploit the cellular resolution while accounting for technical noise. This article describes a computational workflow for low-level analyses of scRNA-seq data, based primarily on software packages from the open-source Bioconductor project. It covers basic steps including quality control, data exploration and normalization, as well as more complex procedures such as cell cycle phase assignment, identification of highly variable and correlated genes, clustering into subpopulations and marker gene detection. Analyses were demonstrated on gene-level count data from several publicly available datasets involving haematopoietic stem cells, brain-derived cells, T-helper cells and mouse embryonic stem cells. This will provide a range of usage scenarios from which readers can construct their own analysis pipelines.
              Bookmark
              • Record: found
              • Abstract: found
              • Article: found
              Is Open Access

              A general and flexible method for signal extraction from single-cell RNA-seq data

              Single-cell RNA-sequencing (scRNA-seq) is a powerful high-throughput technique that enables researchers to measure genome-wide transcription levels at the resolution of single cells. Because of the low amount of RNA present in a single cell, some genes may fail to be detected even though they are expressed; these genes are usually referred to as dropouts. Here, we present a general and flexible zero-inflated negative binomial model (ZINB-WaVE), which leads to low-dimensional representations of the data that account for zero inflation (dropouts), over-dispersion, and the count nature of the data. We demonstrate, with simulated and real data, that the model and its associated estimation procedure are able to give a more stable and accurate low-dimensional representation of the data than principal component analysis (PCA) and zero-inflated factor analysis (ZIFA), without the need for a preliminary normalization step.
                Bookmark

                Author and article information

                Contributors
                Role: ConceptualizationRole: Data curationRole: InvestigationRole: ResourcesRole: SoftwareRole: Writing – original draft
                Role: Data curationRole: Software
                Role: SoftwareRole: Visualization
                Role: Data curationRole: Resources
                Role: Data curationRole: Resources
                Role: Data curationRole: Resources
                Role: Data curationRole: Resources
                Role: ConceptualizationRole: Data curationRole: Funding acquisitionRole: MethodologyRole: ResourcesRole: Supervision
                Role: ConceptualizationRole: Funding acquisitionRole: MethodologyRole: Project administrationRole: ResourcesRole: SupervisionRole: Writing – original draftRole: Writing – review & editing
                Role: Editor
                Journal
                PLoS Comput Biol
                PLoS Comput. Biol
                plos
                ploscomp
                PLoS Computational Biology
                Public Library of Science (San Francisco, CA USA )
                1553-734X
                1553-7358
                August 2018
                10 August 2018
                : 14
                : 8
                : e1006361
                Affiliations
                [1 ] Molecular Medicine Division, The Walter and Eliza Hall Institute of Medical Research, Parkville, Australia
                [2 ] Department of Medical Biology, The University of Melbourne, Parkville, Australia
                [3 ] College of Life Science, Zhejiang University, Hangzhou, Zhejiang Province, P.R. China
                [4 ] Australian Genome Research Facility, Parkville, Australia
                [5 ] School of Mathematics and Statistics, The University of Melbourne, Parkville, Australia
                Johns Hopkins University, UNITED STATES
                Author notes

                The authors have declared that no competing interests exist.

                Author information
                http://orcid.org/0000-0003-3420-3685
                http://orcid.org/0000-0002-7383-0609
                Article
                PCOMPBIOL-D-18-00398
                10.1371/journal.pcbi.1006361
                6105007
                30096152
                2da215bd-921a-4f1e-85a6-d2903e2b6993
                © 2018 Tian et al

                This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

                History
                : 11 March 2018
                : 12 July 2018
                Page count
                Figures: 6, Tables: 1, Pages: 15
                Funding
                Funded by: funder-id http://dx.doi.org/10.13039/501100000925, National Health and Medical Research Council;
                Award ID: GNT1143163, GNT1104924
                Award Recipient :
                Funded by: funder-id http://dx.doi.org/10.13039/501100000925, National Health and Medical Research Council;
                Award ID: GNT1124812, GNT1062820
                Award Recipient :
                Funded by: funder-id http://dx.doi.org/10.13039/501100001782, University of Melbourne;
                Award ID: Melbourne Research Scholarship
                Award Recipient :
                Funded by: Chan Zuckerberg Initiative DAF, an advised fund of Silicon Valley Community Foundation
                Award ID: 2018-182819
                Award Recipient :
                This work was supported by the National Health and Medical Research Council (NHMRC) Project Grants (GNT1143163 to MER, GNT1124812 to SHN and MER, GNT1062820 to SHN), Fellowship GNT1104924 to MER, the Chan Zuckerberg Initiative DAF, an advised fund of Silicon Valley Community Foundation (grant number 2018-182819 to MER), a Melbourne Research Scholarship to LT, Genomics Innovation Hub, Victorian State Government Operational Infrastructure Support and Australian Government NHMRC IRIISS. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
                Categories
                Research Article
                Computer and Information Sciences
                Software Engineering
                Preprocessing
                Engineering and Technology
                Software Engineering
                Preprocessing
                Research and Analysis Methods
                Database and Informatics Methods
                Bioinformatics
                Sequence Analysis
                Sequence Alignment
                Engineering and Technology
                Industrial Engineering
                Quality Control
                Computer and Information Sciences
                Information Technology
                Data Processing
                Biology and Life Sciences
                Genetics
                Gene Expression
                Biology and Life Sciences
                Molecular Biology
                Molecular Biology Techniques
                Gene Mapping
                Exon Mapping
                Research and Analysis Methods
                Molecular Biology Techniques
                Gene Mapping
                Exon Mapping
                Physical Sciences
                Chemistry
                Chemical Elements
                Chromium
                Computer and Information Sciences
                Software Engineering
                Software Tools
                Engineering and Technology
                Software Engineering
                Software Tools
                Custom metadata
                vor-update-to-uncorrected-proof
                2018-08-22
                Datasets are available under GEO accession numbers GSE109999 and GSE111108 or from the Human Cell Atlas Preview Datasets webpage ( https://preview.data.humancellatlas.org/).

                Quantitative & Systems biology
                Quantitative & Systems biology

                Comments

                Comment on this article