A statistical learning method for simultaneous copy number estimation and subclone clustering with single-cell sequencing data

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

The availability of single-cell sequencing (SCS) enables us to assess intra-tumor heterogeneity and identify cellular subclones without the confounding effect of mixed cells. Copy number aberrations (CNAs) have been commonly used to identify subclones in SCS data using various clustering methods, as cells comprising a subpopulation are found to share a genetic profile. However, currently available methods may generate spurious results (e.g., falsely identified variants) in the procedure of CNA detection, thereby diminishing the accuracy of subclone identification within a large, complex cell population. In this study, we developed a subclone clustering method based on a fused lasso model, referred to as FLCNA, which can simultaneously detect CNAs in single-cell DNA sequencing (scDNA-seq) data. Spike-in simulations were conducted to evaluate the clustering and CNA detection performance of FLCNA, benchmarking it against existing copy number estimation methods (SCOPE, HMMcopy) in combination with commonly used clustering methods. Application of FLCNA to a scDNA-seq data set of breast cancer revealed different genomic variation patterns in neoadjuvant chemotherapy-treated samples and pretreated samples. We show that FLCNA is a practical and powerful method for subclone identification and CNA detection with scDNA-seq data.

Related collections

Most cited references 58

Record: found
Abstract: found
Article: not found

Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles

A. Subramanian, P. Tamayo, V. K. Mootha … (2005)

Although genomewide RNA expression analysis has become a routine tool in biomedical research, extracting biological insight from such information remains a major challenge. Here, we describe a powerful analytical method called Gene Set Enrichment Analysis (GSEA) for interpreting gene expression data. The method derives its power by focusing on gene sets, that is, groups of genes that share common biological function, chromosomal location, or regulation. We demonstrate how GSEA yields insights into several cancer-related data sets, including leukemia and lung cancer. Notably, where single-gene analysis finds little similarity between two independent studies of patient survival in lung cancer, GSEA reveals many biological pathways in common. The GSEA method is embodied in a freely available software package, together with an initial database of 1,325 biologically defined gene sets.

0 comments Cited 13972 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

Cytoscape: a software environment for integrated models of biomolecular interaction networks.

Paul Shannon, Andrew Markiel, Owen Ozier … (2003)

Cytoscape is an open source software project for integrating biomolecular interaction networks with high-throughput expression data and other molecular states into a unified conceptual framework. Although applicable to any system of molecular components and interactions, Cytoscape is most powerful when used in conjunction with large databases of protein-protein, protein-DNA, and genetic interactions that are increasingly available for humans and model organisms. Cytoscape's software Core provides basic functionality to layout and query the network; to visually integrate the network with expression profiles, phenotypes, and other molecular states; and to link the network to databases of functional annotations. The Core is extensible through a straightforward plug-in architecture, allowing rapid development of additional computational analyses and features. Several case studies of Cytoscape plug-ins are surveyed, including a search for interaction pathways correlating with changes in gene expression, a study of protein complexes involved in cellular recovery to DNA damage, inference of a combined physical/functional interaction network for Halobacterium, and an interface to detailed stochastic/kinetic gene regulatory models.

0 comments Cited 12107 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: found

Is Open Access

KEGG as a reference resource for gene and protein annotation

Minoru Kanehisa, Yoko Sato, Masayuki Kawashima … (2015)

KEGG (http://www.kegg.jp/ or http://www.genome.jp/kegg/) is an integrated database resource for biological interpretation of genome sequences and other high-throughput data. Molecular functions of genes and proteins are associated with ortholog groups and stored in the KEGG Orthology (KO) database. The KEGG pathway maps, BRITE hierarchies and KEGG modules are developed as networks of KO nodes, representing high-level functions of the cell and the organism. Currently, more than 4000 complete genomes are annotated with KOs in the KEGG GENES database, which can be used as a reference data set for KO assignment and subsequent reconstruction of KEGG pathways and other molecular networks. As an annotation resource, the following improvements have been made. First, each KO record is re-examined and associated with protein sequence data used in experiments of functional characterization. Second, the GENES database now includes viruses, plasmids, and the addendum category for functionally characterized proteins that are not represented in complete genomes. Third, new automatic annotation servers, BlastKOALA and GhostKOALA, are made available utilizing the non-redundant pangenome data set generated from the GENES database. As a resource for translational bioinformatics, various data sets are created for antimicrobial resistance and drug interaction networks.

0 comments Cited 1640 times – based on 0 reviews      Review now

Bookmark

All references

Author and article information

Journal

Journal ID (nlm-ta): Genome Res

Journal ID (iso-abbrev): Genome Res

Journal ID (hwp): genome

Journal ID (publisher-id): GENOME

Title: Genome Research

Publisher: Cold Spring Harbor Laboratory Press

ISSN (Print): 1088-9051

ISSN (Electronic): 1549-5469

Publication date (Print): January 2024

Volume: 34

Issue: 1

Pages: 85-93

Affiliations

[1 ]Department of Epidemiology and Biostatistics, Arnold School of Public Health, University of South Carolina, Columbia, South Carolina 29208, USA;

[2 ]Department of Environmental Health Science, Arnold School of Public Health, University of South Carolina, Columbia, South Carolina 29208, USA;

[3 ]Department of Quantitative Sciences, Baylor College of Medicine, Houston, Texas 77030, USA;

[4 ]Department of Biostatistics, College of Public Health and Health Professions and College of Medicine, University of Florida, Gainesville, Florida 32603, USA

Author notes

Corresponding author: feifeixiao@ 123456ufl.edu

Author information

Fei Qin http://orcid.org/0000-0003-3678-2879

Christopher I. Amos http://orcid.org/0000-0002-8540-7023

Feifei Xiao http://orcid.org/0000-0002-1597-4719

Article

Medline ID: 9509184

DOI: 10.1101/gr.278098.123

PMC ID: 10903939

PubMed ID: 38290978

SO-VID: 3e61c550-bc05-496a-afa5-4ba5cf6cf9ca

License:

This article is distributed exclusively by Cold Spring Harbor Laboratory Press for the first six months after the full-issue publication date (see https://genome.cshlp.org/site/misc/terms.xhtml). After six months, it is available under a Creative Commons License (Attribution-NonCommercial 4.0 International), as described at http://creativecommons.org/licenses/by-nc/4.0/.

History

Date received : 15 May 2023

Date accepted : 8 January 2024

Page count

Pages: 9

Funding

Funded by: National Institutes of Health , doi 10.13039/100000002;

Award ID: R21 HG010925

A statistical learning method for simultaneous copy number estimation and subclone clustering with single-cell sequencing data

Read this article at

Abstract

Related collections

Nanopublications (single, attributable and machine-readable assertions in scientific literature)

Most cited references 58

Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles

Cytoscape: a software environment for integrated models of biomolecular interaction networks.

KEGG as a reference resource for gene and protein annotation

Author and article information

Journal

Affiliations

Author notes

Author information

Article

History

Page count

Funding

Categories

Comments

Comment on this article

Similar content 143

Most referenced authors 2,057