Segmental duplications and their variation in a complete human genome

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

Despite their importance in disease and evolution, highly identical segmental duplications (SDs) are among the last regions of the human reference genome (GRCh38) to be fully sequenced. Using a complete telomere-to-telomere human genome (T2T-CHM13), we present a comprehensive view of human SD organization. SDs account for nearly one-third of the additional sequence, increasing the genome-wide estimate from 5.4 to 7.0% [218 million base pairs (Mbp)]. An analysis of 268 human genomes shows that 91% of the previously unresolved T2T-CHM13 SD sequence (68.3 Mbp) better represents human copy number variation. Comparing long-read assemblies from human ( n = 12) and nonhuman primate ( n = 5) genomes, we systematically reconstruct the evolution and structural haplotype diversity of biomedically relevant and duplicated genes. This analysis reveals patterns of structural heterozygosity and evolutionary differences in SD organization between humans and other primates.

Abstract

INTRODUCTION

Large, high-identity duplicated sequences—termed segmental duplications (SDs)—are frequently the last regions of genomes to be sequenced and assembled. While the human reference genome provided a roadmap of the SD landscape, >50% of the remaining gaps correspond to regions of complex SDs.

RATIONALE

SDs are major sources of evolutionary gene innovations and contribute disproportionately to genetic variation within and between ape species. With the complete human genome (T2T-CHM13), researchers have the potential to identify genes and uncover patterns of human genetic variation.

RESULTS

We identified 51 million base pairs (Mbp) of additional human SD in T2T-CHM13 and now estimate that 7% of the human genome consists of SDs [(218 Mbp of 3.1 billion base pairs (Gbp)]. SDs make up two-thirds (45.1 of 68.1 Mbp) of acrocentric short arms, and these SDs are the largest in the human genome (see the figure, panel A). Additionally, 54% of acrocentric SDs are copy number variable or map to different chromosomes among the six individuals examined. A detailed comparison between the current reference genome (GRCh38) and T2T-CHM13 for SD content identifies 81 Mbp of previously unresolved or structurally variable SDs. Short-read whole-genome sequence data from a diversity panel of 268 humans show that human copy number is nine times (59.26 versus 6.55 Mbp) more likely to match T2T-CHM13 rather than GRCh38, including 119 protein-coding genes (see the figure, panel B). Using long-read–sequencing data from 25 human haplotypes, we investigated patterns of human genetic variation identifying significant increases in structural and single-nucleotide diversity. We identified gene-rich regions (e.g., TBC1D3 ) that vary by hundreds of kilo–base pairs and gene copy number between individuals showing some of the highest genome-wide structural heterozygosity (85 to 90%). Our analysis identified 182 candidate protein-coding genes as well as the complete sequence for structurally variable gene models that were previously unresolved. Among these is the complete gene structure of lipoprotein A ( LPA ), including the expanded kringle IV repeat domain. Reduced copies of this domain are among the strongest genetic associations with cardiovascular disease, especially among African Americans, and sequencing of multiple human haplotypes identified not only copy number variation but also other forms of rare coding variation potentially relevant to disease risk. Finally, we compared global methylation and expression patterns between duplicated and unique genes. Transcriptionally inactive duplicate genes are more likely to map to hypomethylated genomic regions; however, specifically over the transcription start site we observe an increase in methylation, suggesting that as many as two-thirds of duplicated genes are epigenetically silenced. Additionally, SD genes show a high degree of concordance between methylation profiles and transcription levels, allowing us to define the actively transcribed members of high-identity gene families that are otherwise indistinguishable by coding sequence.

CONCLUSION

A complete human genome provides a more comprehensive understanding of the organization, expression, and regulation of duplicated genes. Our analysis reveals underappreciated patterns of human genetic diversity and suggests characteristic features of methylation and gene regulation. This resource will serve as a critical baseline for improved gene annotation, genotyping, and previously unknown associations for some of the most dynamic regions of our genome.

Related collections

Most cited references 109

Record: found
Abstract: found
Article: not found

Basic local alignment search tool.

Stephen F Altschul, Warren Gish, Webb Miller … (1990)

A new approach to rapid sequence comparison, basic local alignment search tool (BLAST), directly approximates alignments that optimize a measure of local similarity, the maximal segment pair (MSP) score. Recent mathematical results on the stochastic properties of MSP scores allow an analysis of the performance of this method as well as the statistical significance of alignments it generates. The basic algorithm is simple and robust; it can be implemented in a number of ways and applied in a variety of contexts including straightforward DNA and protein sequence database searches, motif searches, gene identification searches, and in the analysis of multiple regions of similarity in long DNA sequences. In addition to its flexibility and tractability to mathematical analysis, BLAST is an order of magnitude faster than existing sequence comparison tools of comparable sensitivity.

0 comments Cited 9357 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: found

Is Open Access

RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies

Alexandros Stamatakis (2014)

Motivation: Phylogenies are increasingly used in all fields of medical and biological research. Moreover, because of the next-generation sequencing revolution, datasets used for conducting phylogenetic analyses grow at an unprecedented pace. RAxML (Randomized Axelerated Maximum Likelihood) is a popular program for phylogenetic analyses of large datasets under maximum likelihood. Since the last RAxML paper in 2006, it has been continuously maintained and extended to accommodate the increasingly growing input datasets and to serve the needs of the user community. Results: I present some of the most notable new features and extensions of RAxML, such as a substantial extension of substitution models and supported data types, the introduction of SSE3, AVX and AVX2 vector intrinsics, techniques for reducing the memory requirements of the code and a plethora of operations for conducting post-analyses on sets of trees. In addition, an up-to-date 50-page user manual covering all new RAxML options is available. Availability and implementation: The code is available under GNU GPL at https://github.com/stamatak/standard-RAxML. Contact: alexandros.stamatakis@h-its.org Supplementary information: Supplementary data are available at Bioinformatics online.

0 comments Cited 7656 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: found

Is Open Access

BEDTools: a flexible suite of utilities for comparing genomic features

Aaron Quinlan, Ira Hall (2010)

Motivation: Testing for correlations between different sets of genomic features is a fundamental task in genomics research. However, searching for overlaps between features with existing web-based methods is complicated by the massive datasets that are routinely produced with current sequencing technologies. Fast and flexible tools are therefore required to ask complex questions of these data in an efficient manner. Results: This article introduces a new software suite for the comparison, manipulation and annotation of genomic features in Browser Extensible Data (BED) and General Feature Format (GFF) format. BEDTools also supports the comparison of sequence alignments in BAM format to both BED and GFF features. The tools are extremely efficient and allow the user to compare large datasets (e.g. next-generation sequencing data) with both public and custom genome annotation tracks. BEDTools can be combined with one another as well as with standard UNIX commands, thus facilitating routine genomics tasks as well as pipelines that can quickly answer intricate questions of large genomic datasets. Availability and implementation: BEDTools was written in C++. Source code and a comprehensive user manual are freely available at http://code.google.com/p/bedtools Contact: aaronquinlan@gmail.com; imh4y@virginia.edu Supplementary information: Supplementary data are available at Bioinformatics online.

0 comments Cited 6925 times – based on 0 reviews      Review now

Bookmark

All references

Author and article information

Contributors

Mitchell R. Vollger: (View ORCID Profile)

Xavi Guitart: (View ORCID Profile)

Philip C. Dishuck: (View ORCID Profile)

Ludovica Mercuri: (View ORCID Profile)

William T. Harvey: (View ORCID Profile)

Ariel Gershman: (View ORCID Profile)

Mark Diekhans: (View ORCID Profile)

Arvis Sulovari: (View ORCID Profile)

Katherine M. Munson: (View ORCID Profile)

Alexandra P. Lewis: (View ORCID Profile)

Kendra Hoekzema: (View ORCID Profile)

David Porubsky: (View ORCID Profile)

Ruiyang Li: (View ORCID Profile)

Sergey Nurk: (View ORCID Profile)

Sergey Koren: (View ORCID Profile)

Karen H. Miga: (View ORCID Profile)

Adam M. Phillippy: (View ORCID Profile)

Winston Timp: (View ORCID Profile)

Mario Ventura: (View ORCID Profile)

Evan E. Eichler: (View ORCID Profile)

Journal

Title: Science

Abbreviated Title: Science

Publisher: American Association for the Advancement of Science (AAAS)

ISSN (Print): 0036-8075

ISSN (Electronic): 1095-9203

Publication date Created: April 2022

Publication date (Print): April 2022

Volume: 376

Issue: 6588

Affiliations

[1 ]Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA.

[2 ]Department of Biology, University of Bari, Aldo Moro, Bari 70125, Italy.

[3 ]Department of Molecular Biology and Genetics, Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA.

[4 ]UC Santa Cruz Genomics Institute, University of California Santa Cruz, Santa Cruz, CA, USA.

[5 ]Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA.

[6 ]Howard Hughes Medical Institute, University of Washington, Seattle, WA, USA.

Article

DOI: 10.1126/science.abj6965

PubMed ID: 35357917

SO-VID: 762fabe9-5a53-4d03-9a8d-8454fb42355f

History

Data availability:

Comments

Comment on this article

scite_

Cited by 84

See all cited by

Most referenced authors 7,723

See all reference authors

Segmental duplications and their variation in a complete human genome

Read this article at

Abstract

Abstract

INTRODUCTION

RATIONALE

RESULTS

CONCLUSION

Related collections

Genome Integrity

Most cited references 109

Basic local alignment search tool.

RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies

BEDTools: a flexible suite of utilities for comparing genomic features

Author and article information

Contributors

Journal

Affiliations

Article

History

Comments

Comment on this article

Similar content 36

Cited by 84

Most referenced authors 7,723