Despite their importance in disease and evolution, highly identical segmental duplications (SDs) are among the last regions of the human reference genome (GRCh38) to be fully sequenced. Using a complete telomere-to-telomere human genome (T2T-CHM13), we present a comprehensive view of human SD organization. SDs account for nearly one-third of the additional sequence, increasing the genome-wide estimate from 5.4 to 7.0% [218 million base pairs (Mbp)]. An analysis of 268 human genomes shows that 91% of the previously unresolved T2T-CHM13 SD sequence (68.3 Mbp) better represents human copy number variation. Comparing long-read assemblies from human ( n = 12) and nonhuman primate ( n = 5) genomes, we systematically reconstruct the evolution and structural haplotype diversity of biomedically relevant and duplicated genes. This analysis reveals patterns of structural heterozygosity and evolutionary differences in SD organization between humans and other primates.
Large, high-identity duplicated sequences—termed segmental duplications (SDs)—are frequently the last regions of genomes to be sequenced and assembled. While the human reference genome provided a roadmap of the SD landscape, >50% of the remaining gaps correspond to regions of complex SDs.
SDs are major sources of evolutionary gene innovations and contribute disproportionately to genetic variation within and between ape species. With the complete human genome (T2T-CHM13), researchers have the potential to identify genes and uncover patterns of human genetic variation.
We identified 51 million base pairs (Mbp) of additional human SD in T2T-CHM13 and now estimate that 7% of the human genome consists of SDs [(218 Mbp of 3.1 billion base pairs (Gbp)]. SDs make up two-thirds (45.1 of 68.1 Mbp) of acrocentric short arms, and these SDs are the largest in the human genome (see the figure, panel A). Additionally, 54% of acrocentric SDs are copy number variable or map to different chromosomes among the six individuals examined. A detailed comparison between the current reference genome (GRCh38) and T2T-CHM13 for SD content identifies 81 Mbp of previously unresolved or structurally variable SDs. Short-read whole-genome sequence data from a diversity panel of 268 humans show that human copy number is nine times (59.26 versus 6.55 Mbp) more likely to match T2T-CHM13 rather than GRCh38, including 119 protein-coding genes (see the figure, panel B). Using long-read–sequencing data from 25 human haplotypes, we investigated patterns of human genetic variation identifying significant increases in structural and single-nucleotide diversity. We identified gene-rich regions (e.g., TBC1D3 ) that vary by hundreds of kilo–base pairs and gene copy number between individuals showing some of the highest genome-wide structural heterozygosity (85 to 90%). Our analysis identified 182 candidate protein-coding genes as well as the complete sequence for structurally variable gene models that were previously unresolved. Among these is the complete gene structure of lipoprotein A ( LPA ), including the expanded kringle IV repeat domain. Reduced copies of this domain are among the strongest genetic associations with cardiovascular disease, especially among African Americans, and sequencing of multiple human haplotypes identified not only copy number variation but also other forms of rare coding variation potentially relevant to disease risk. Finally, we compared global methylation and expression patterns between duplicated and unique genes. Transcriptionally inactive duplicate genes are more likely to map to hypomethylated genomic regions; however, specifically over the transcription start site we observe an increase in methylation, suggesting that as many as two-thirds of duplicated genes are epigenetically silenced. Additionally, SD genes show a high degree of concordance between methylation profiles and transcription levels, allowing us to define the actively transcribed members of high-identity gene families that are otherwise indistinguishable by coding sequence.
A complete human genome provides a more comprehensive understanding of the organization, expression, and regulation of duplicated genes. Our analysis reveals underappreciated patterns of human genetic diversity and suggests characteristic features of methylation and gene regulation. This resource will serve as a critical baseline for improved gene annotation, genotyping, and previously unknown associations for some of the most dynamic regions of our genome.