8
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      A Bayesian framework for inter-cellular information sharing improves dscRNA-seq quantification

      research-article

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Motivation

          Droplet-based single-cell RNA-seq (dscRNA-seq) data are being generated at an unprecedented pace, and the accurate estimation of gene-level abundances for each cell is a crucial first step in most dscRNA-seq analyses. When pre-processing the raw dscRNA-seq data to generate a count matrix, care must be taken to account for the potentially large number of multi-mapping locations per read. The sparsity of dscRNA-seq data, and the strong 3’ sampling bias, makes it difficult to disambiguate cases where there is no uniquely mapping read to any of the candidate target genes.

          Results

          We introduce a Bayesian framework for information sharing across cells within a sample, or across multiple modalities of data using the same sample, to improve gene quantification estimates for dscRNA-seq data. We use an anchor-based approach to connect cells with similar gene-expression patterns, and learn informative, empirical priors which we provide to alevin’s gene multi-mapping resolution algorithm. This improves the quantification estimates for genes with no uniquely mapping reads (i.e. when there is no unique intra-cellular information). We show our new model improves the per cell gene-level estimates and provides a principled framework for information sharing across multiple modalities. We test our method on a combination of simulated and real datasets under various setups.

          Availability and implementation

          The information sharing model is included in alevin and is implemented in C++14. It is available as open-source software, under GPL v3, at https://github.com/COMBINE-lab/salmon as of version 1.1.0.

          Related collections

          Most cited references19

          • Record: found
          • Abstract: found
          • Article: found
          Is Open Access

          The Network of Cancer Genes (NCG): a comprehensive catalogue of known and candidate cancer genes from cancer sequencing screens

          The Network of Cancer Genes (NCG) is a manually curated repository of 2372 genes whose somatic modifications have known or predicted cancer driver roles. These genes were collected from 275 publications, including two sources of known cancer genes and 273 cancer sequencing screens of more than 100 cancer types from 34,905 cancer donors and multiple primary sites. This represents a more than 1.5-fold content increase compared to the previous version. NCG also annotates properties of cancer genes, such as duplicability, evolutionary origin, RNA and protein expression, miRNA and protein interactions, and protein function and essentiality. NCG is accessible at http://ncg.kcl.ac.uk/. Electronic supplementary material The online version of this article (10.1186/s13059-018-1612-0) contains supplementary material, which is available to authorized users.
            Bookmark
            • Record: found
            • Abstract: found
            • Article: not found

            SAVER: Gene expression recovery for single-cell RNA sequencing

            In single-cell RNA sequencing (scRNA-seq) studies, only a small fraction of the transcripts present in each cell are sequenced. This leads to unreliable quantification of lowly and moderately expressed genes which hinders downstream analysis. To address this challenge, we introduce SAVER (Single-cell Analysis Via Expression Recovery), an expression recovery method for UMI-based scRNA-seq data that borrows information across genes and cells to obtain accurate expression estimates for all genes.
              Bookmark
              • Record: found
              • Abstract: found
              • Article: not found

              OUP accepted manuscript

              Until recently, high-throughput gene expression technology, such as RNA-Sequencing (RNA-seq) required hundreds of thousands of cells to produce reliable measurements. Recent technical advances permit genome-wide gene expression measurement at the single-cell level. Single-cell RNA-Seq (scRNA-seq) is the most widely used and numerous publications are based on data produced with this technology. However, RNA-seq and scRNA-seq data are markedly different. In particular, unlike RNA-seq, the majority of reported expression levels in scRNA-seq are zeros, which could be either biologically-driven, genes not expressing RNA at the time of measurement, or technically-driven, genes expressing RNA, but not at a sufficient level to be detected by sequencing technology. Another difference is that the proportion of genes reporting the expression level to be zero varies substantially across single cells compared to RNA-seq samples. However, it remains unclear to what extent this cell-to-cell variation is being driven by technical rather than biological variation. Furthermore, while systematic errors, including batch effects, have been widely reported as a major challenge in high-throughput technologies, these issues have received minimal attention in published studies based on scRNA-seq technology. Here, we use an assessment experiment to examine data from published studies and demonstrate that systematic errors can explain a substantial percentage of observed cell-to-cell expression variability. Specifically, we present evidence that some of these reported zeros are driven by technical variation by demonstrating that scRNA-seq produces more zeros than expected and that this bias is greater for lower expressed genes. In addition, this missing data problem is exacerbated by the fact that this technical variation varies cell-to-cell. Then, we show how this technical cell-to-cell variability can be confused with novel biological results. Finally, we demonstrate and discuss how batch-effects and confounded experiments can intensify the problem.
                Bookmark

                Author and article information

                Journal
                Bioinformatics
                Bioinformatics
                bioinformatics
                Bioinformatics
                Oxford University Press
                1367-4803
                1367-4811
                July 2020
                13 July 2020
                13 July 2020
                : 36
                : Suppl 1 , ISMB 2020 Proceedings
                : i292-i299
                Affiliations
                [b1 ]Department of Computer Science, Stony Brook University , Stony Brook 11794, NY, USA
                [b2 ]Computer Science Department, University of Maryland , College Park 20742, MD, USA
                Author notes
                To whom correspondence should be addressed. E-mail: asrivastava@ 123456cs.stonybrook.edu or rob@ 123456cs.umd.edu
                Article
                btaa450
                10.1093/bioinformatics/btaa450
                7355277
                32657394
                4a56ffbc-e47b-4a15-b5e0-904e1b67b752
                © The Author(s) 2020. Published by Oxford University Press.

                This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License ( http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com

                History
                Page count
                Pages: 8
                Funding
                Funded by: National Institutes of Health, DOI 10.13039/100000002;
                Award ID: R01 HG009937
                Funded by: NSF, DOI 10.13039/100000001;
                Award ID: CCF-1750472
                Award ID: CNS-1763680
                Categories
                Macromolecular Sequence, Structure, and Function

                Bioinformatics & Computational biology
                Bioinformatics & Computational biology

                Comments

                Comment on this article

                scite_
                0
                0
                0
                0
                Smart Citations
                0
                0
                0
                0
                Citing PublicationsSupportingMentioningContrasting
                View Citations

                See how this article has been cited at scite.ai

                scite shows how a scientific paper has been cited by providing the context of the citation, a classification describing whether it supports, mentions, or contrasts the cited claim, and a label indicating in which section the citation was made.

                Similar content23

                Cited by13

                Most referenced authors5,768