A Bayesian framework for inter-cellular information sharing improves dscRNA-seq quantification

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

Motivation

Droplet-based single-cell RNA-seq (dscRNA-seq) data are being generated at an unprecedented pace, and the accurate estimation of gene-level abundances for each cell is a crucial first step in most dscRNA-seq analyses. When pre-processing the raw dscRNA-seq data to generate a count matrix, care must be taken to account for the potentially large number of multi-mapping locations per read. The sparsity of dscRNA-seq data, and the strong 3’ sampling bias, makes it difficult to disambiguate cases where there is no uniquely mapping read to any of the candidate target genes.

Results

We introduce a Bayesian framework for information sharing across cells within a sample, or across multiple modalities of data using the same sample, to improve gene quantification estimates for dscRNA-seq data. We use an anchor-based approach to connect cells with similar gene-expression patterns, and learn informative, empirical priors which we provide to alevin’s gene multi-mapping resolution algorithm. This improves the quantification estimates for genes with no uniquely mapping reads (i.e. when there is no unique intra-cellular information). We show our new model improves the per cell gene-level estimates and provides a principled framework for information sharing across multiple modalities. We test our method on a combination of simulated and real datasets under various setups.

Availability and implementation

The information sharing model is included in alevin and is implemented in C++14. It is available as open-source software, under GPL v3, at https://github.com/COMBINE-lab/salmon as of version 1.1.0.

Related collections

Most cited references 19

Record: found
Abstract: found
Article: found

Is Open Access

The Network of Cancer Genes (NCG): a comprehensive catalogue of known and candidate cancer genes from cancer sequencing screens

Dimitra Repana, Joel Nulsen, Lisa Dressler … (2019)

The Network of Cancer Genes (NCG) is a manually curated repository of 2372 genes whose somatic modifications have known or predicted cancer driver roles. These genes were collected from 275 publications, including two sources of known cancer genes and 273 cancer sequencing screens of more than 100 cancer types from 34,905 cancer donors and multiple primary sites. This represents a more than 1.5-fold content increase compared to the previous version. NCG also annotates properties of cancer genes, such as duplicability, evolutionary origin, RNA and protein expression, miRNA and protein interactions, and protein function and essentiality. NCG is accessible at http://ncg.kcl.ac.uk/. Electronic supplementary material The online version of this article (10.1186/s13059-018-1612-0) contains supplementary material, which is available to authorized users.

0 comments Cited 623 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

SAVER: Gene expression recovery for single-cell RNA sequencing

Mo Huang, Jingshu Wang, Eduardo Torre … (2018)

In single-cell RNA sequencing (scRNA-seq) studies, only a small fraction of the transcripts present in each cell are sequenced. This leads to unreliable quantification of lowly and moderately expressed genes which hinders downstream analysis. To address this challenge, we introduce SAVER (Single-cell Analysis Via Expression Recovery), an expression recovery method for UMI-based scRNA-seq data that borrows information across genes and cells to obtain accurate expression estimates for all genes.

0 comments Cited 306 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

OUP accepted manuscript

Stephanie Hicks, F. William Townes, Mingxiang Teng … (2018)

Until recently, high-throughput gene expression technology, such as RNA-Sequencing (RNA-seq) required hundreds of thousands of cells to produce reliable measurements. Recent technical advances permit genome-wide gene expression measurement at the single-cell level. Single-cell RNA-Seq (scRNA-seq) is the most widely used and numerous publications are based on data produced with this technology. However, RNA-seq and scRNA-seq data are markedly different. In particular, unlike RNA-seq, the majority of reported expression levels in scRNA-seq are zeros, which could be either biologically-driven, genes not expressing RNA at the time of measurement, or technically-driven, genes expressing RNA, but not at a sufficient level to be detected by sequencing technology. Another difference is that the proportion of genes reporting the expression level to be zero varies substantially across single cells compared to RNA-seq samples. However, it remains unclear to what extent this cell-to-cell variation is being driven by technical rather than biological variation. Furthermore, while systematic errors, including batch effects, have been widely reported as a major challenge in high-throughput technologies, these issues have received minimal attention in published studies based on scRNA-seq technology. Here, we use an assessment experiment to examine data from published studies and demonstrate that systematic errors can explain a substantial percentage of observed cell-to-cell expression variability. Specifically, we present evidence that some of these reported zeros are driven by technical variation by demonstrating that scRNA-seq produces more zeros than expected and that this bias is greater for lower expressed genes. In addition, this missing data problem is exacerbated by the fact that this technical variation varies cell-to-cell. Then, we show how this technical cell-to-cell variability can be confused with novel biological results. Finally, we demonstrate and discuss how batch-effects and confounded experiments can intensify the problem.

0 comments Cited 195 times – based on 0 reviews      Review now

Bookmark

All references

Author and article information

Journal

Journal ID (nlm-ta): Bioinformatics

Journal ID (iso-abbrev): Bioinformatics

Journal ID (publisher-id): bioinformatics

Title: Bioinformatics

Publisher: Oxford University Press

ISSN (Print): 1367-4803

ISSN (Electronic): 1367-4811

Publication date (Print): July 2020

Publication date (Electronic): 13 July 2020

Publication date PMC-release: 13 July 2020

Volume: 36

Issue: Suppl 1 , ISMB 2020 Proceedings

Pages: i292-i299

Affiliations

[b1 ]Department of Computer Science, Stony Brook University , Stony Brook 11794, NY, USA

[b2 ]Computer Science Department, University of Maryland , College Park 20742, MD, USA

Author notes

To whom correspondence should be addressed. E-mail: asrivastava@ 123456cs.stonybrook.edu or rob@ 123456cs.umd.edu

Article

Publisher ID: btaa450

DOI: 10.1093/bioinformatics/btaa450

PMC ID: 7355277

PubMed ID: 32657394

SO-VID: 4a56ffbc-e47b-4a15-b5e0-904e1b67b752

License:

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License ( http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com

History

Page count

Pages: 8

Funding

Funded by: National Institutes of Health, DOI 10.13039/100000002;

Award ID: R01 HG009937

Funded by: NSF, DOI 10.13039/100000001;

Award ID: CCF-1750472

Award ID: CNS-1763680

Comments

Comment on this article

scite_

Smart Citations

Citing PublicationsSupportingMentioningContrasting

View Citations

See how this article has been cited at scite.ai

scite shows how a scientific paper has been cited by providing the context of the citation, a classification describing whether it supports, mentions, or contrasts the cited claim, and a label indicating in which section the citation was made.

Cited by 13

See all cited by

Most referenced authors 5,768

See all reference authors

A Bayesian framework for inter-cellular information sharing improves dscRNA-seq quantification

Read this article at

Abstract

Motivation

Results

Availability and implementation

Related collections

Databases and Data Resources for Drug Repurposing (REPO4EU)

Most cited references 19

The Network of Cancer Genes (NCG): a comprehensive catalogue of known and candidate cancer genes from cancer sequencing screens

SAVER: Gene expression recovery for single-cell RNA sequencing

OUP accepted manuscript

Author and article information

Journal

Affiliations

Author notes

Article

History

Page count

Funding

Categories

Comments

Comment on this article

Similar content 23

Cited by 13

Most referenced authors 5,768