129
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      Architecture and evolution of a minute plant genome

      research-article
      1 , 2 , 1 , 3 , 1 , 4 , 4 , 4 , 5 , 4 , 6 , 6 , 1 , 7 , 8 , 9 , 10 , 11 , 12 , 13 , 12 , 13 , 14 , 15 , 15 , 1 , 1 , 1 , 1 , 16 , 17 , 17 , 1 , 4 , 1
      Nature

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          It has been argued that the evolution of plant genome size is principally unidirectional and increasing owing to the varied action of whole-genome duplications (WGDs) and mobile element proliferation 1 . However, extreme genome size reductions have been reported in the angiosperm family tree. Here we report the sequence of the 82-megabase genome of the carnivorous bladderwort plant Utricularia gibba. Despite its tiny size, the U. gibba genome accommodates a typical number of genes for a plant, with the main difference from other plant genomes arising from a drastic reduction in non-genic DNA. Unexpectedly, we identified at least three rounds of WGD in U. gibba since common ancestry with tomato ( Solanum) and grape ( Vitis). The compressed architecture of the U. gibba genome indicates that a small fraction of intergenic DNA, with few or no active retrotransposons, is sufficient to regulate and integrate all the processes required for the development and reproduction of a complex organism.

          Related collections

          Most cited references19

          • Record: found
          • Abstract: found
          • Article: not found
          Is Open Access

          The grapevine genome sequence suggests ancestral hexaploidization in major angiosperm phyla.

          The analysis of the first plant genomes provided unexpected evidence for genome duplication events in species that had previously been considered as true diploids on the basis of their genetics. These polyploidization events may have had important consequences in plant evolution, in particular for species radiation and adaptation and for the modulation of functional capacities. Here we report a high-quality draft of the genome sequence of grapevine (Vitis vinifera) obtained from a highly homozygous genotype. The draft sequence of the grapevine genome is the fourth one produced so far for flowering plants, the second for a woody species and the first for a fruit crop (cultivated for both fruit and beverage). Grapevine was selected because of its important place in the cultural heritage of humanity beginning during the Neolithic period. Several large expansions of gene families with roles in aromatic features are observed. The grapevine genome has not undergone recent genome duplication, thus enabling the discovery of ancestral traits and features of the genetic organization of flowering plants. This analysis reveals the contribution of three ancestral genomes to the grapevine haploid content. This ancestral arrangement is common to many dicotyledonous plants but is absent from the genome of rice, which is a monocotyledon. Furthermore, we explain the chronology of previously described whole-genome duplication events in the evolution of flowering plants.
            Bookmark
            • Record: found
            • Abstract: found
            • Article: found
            Is Open Access

            The tomato genome sequence provides insights into fleshy fruit evolution

            Introductory Paragraph Tomato (Solanum lycopersicum) is a major crop plant and a model system for fruit development. Solanum is one of the largest angiosperm genera 1 and includes annual and perennial plants from diverse habitats. We present a high quality genome sequence of domesticated tomato, a draft sequence of its closest wild relative, S. pimpinellifolium 2 , and compare them to each other and to potato (S. tuberosum). The two tomato genomes show only 0.6% nucleotide divergence and signs of recent admixture, but show >8% divergence from potato, with nine large and several smaller inversions. In contrast to Arabidopsis, but similar to soybean, tomato and potato, small RNAs map predominantly to gene-rich chromosomal regions, including gene promoters. The Solanum lineage has experienced two consecutive genome triplications: one that is ancient and shared with rosids, and a more recent one. These triplications set the stage for the neofunctionalization of genes controlling fruit characteristics, such as colour and fleshiness. Main Text The genome of the inbred tomato cultivar ‘Heinz 1706’ was sequenced and assembled using a combination of Sanger and “next generation” technologies (Supplementary Section 1). The predicted genome size is ~900 Mb, consistent with prior estimates 3 , of which 760 Mb were assembled in 91 scaffolds aligned to the 12 tomato chromosomes, with most gaps restricted to pericentromeric regions (Fig. 1A; Supplementary Fig. 1). Base accuracy is approximately one substitution error per 29.4 kb and one indel error per 6.4 kb. The scaffolds were linked with two BAC-based physical maps and anchored/oriented using a high-density genetic map, introgression line mapping and BAC fluorescence in situ hybridisation (FISH). The genome of S. pimpinellifolium (accession LA1589) was sequenced and assembled de novo using Illumina short reads, yielding a 739 Mb draft genome (Supplementary Section 3). Estimated divergence between the wild and domesticated genomes is 0.6% (5.4M SNPs distributed along the chromosomes (Fig. 1A, Supplementary Fig. 1)). Tomato chromosomes consist of pericentric heterochromatin and distal euchromatin, with repeats concentrated within and around centromeres, in chromomeres and at telomeres (Fig. 1A, Supplementary Fig. 1). Substantially higher densities of recombination, genes and transcripts are observed in euchromatin, while chloroplast insertions (Supplementary Sections 1.22-1.23) and conserved miRNA genes (Supplementary Section 2.9) are more evenly distributed throughout the genome. The genome is highly syntenic with those of other economically important Solanaceae (Fig. 1B). Compared to the genomes of Arabidopsis 4 and sorghum 5 , tomato has fewer high-copy, full-length LTR retrotransposons with older average insertion ages (2.8 versus 0.8 mya) and fewer high-frequency k-mers (Supplementary Section 2.10). This supports previous findings that the tomato genome is unusual among angiosperms by being largely comprised of low-copy DNA 6,7 . The pipeline used to annotate the tomato and potato 8 genomes is described in Supplementary Section 2. It predicted 34,727 and 35,004 protein-coding genes, respectively. Of these, 30,855 and 32,988, respectively, are supported by RNA-Seq data, and 31,741 and 32,056, respectively, show high similarity to Arabidopsis genes (Supplementary section 2.1). Chromosomal organisation of genes, transcripts, repeats and sRNAs is very similar in the two species (Supplementary Figures 2-4). The protein coding genes of tomato, potato, Arabidopsis, rice and grape were clustered into 23,208 gene groups (≥2 members), of which 8,615 are common to all five genomes, 1,727 are confined to eudicots (tomato, potato, grape and Arabidopsis), and 727 are confined to plants with fleshy fruits (tomato, potato and grape) (Supplementary Section 5.1, Supplementary Fig. 5). Relative expression of all tomato genes was determined by replicated strand-specific Illumina RNA-Seq of root, leaf, flower (2 stages) and fruit (6 stages) in addition to leaf and fruit (3 stages) of S. pimpinellifolium (Supplementary Table 1). sRNA sequencing data supported the prediction of 96 conserved miRNA genes in tomato and 120 in potato, a number consistent with other plant species (Fig. 1A, Supplementary Figures 1 and 3, Supplementary Section 2.9). Among the 34 miRNA families identified, 10 are highly conserved in plants and similarly represented in the two species, whereas other, less conserved families are more abundant in potato. Several miRNAs, predicted to target TIR-NBS-LRR genes, appeared to be preferentially or exclusively expressed in potato (Supplementary Section 2.9). Supplementary section 4 deals with comparative genomic studies. Sequence alignment of 71 Mb of euchromatic tomato genomic DNA to their potato 8 counterparts revealed 8.7% nucleotide divergence (Supplementary Section 4.1). Intergenic and repeat-rich heterochromatic sequences showed more than 30% nucleotide divergence, consistent with the high sequence diversity in these regions among potato genotypes 8 . Alignment of tomato-potato orthologous regions confirmed 9 large inversions known from cytological or genetic studies and several smaller ones (Fig. 1C). The exact number of small inversions is difficult to determine due to the lack of orientation of most potato scaffolds. 18,320 clearly orthologous tomato-potato gene pairs were identified. Of these, 138 (0.75%) had significantly higher than average non-synonymous (Ka) versus synonymous (Ks) nucleotide substitution rate ratios (ω), suggesting diversifying selection, whereas 147 (0.80%) had significantly lower than average ω, suggesting purifying selection (Supplementary Table 2). The proportions of high and low ω between sorghum and maize (Zea mays) are 0.70% and 1.19%, respectively, after 11.9 Myr of divergence 9 , suggesting that diversifying selection may have been stronger in tomato-potato. The highest densities of low-ω genes are found in collinear blocks with average Ks >1.5, tracing to a genome triplication shared with grape (see below) (Fig. 1C, Supplementary Fig. 6, Supplementary Table 3). These genes, which have been preserved in paleo-duplicated locations for more than 100 Myr 10,11 are more constrained than ‘average’ genes and are enriched for transcription factors and genes otherwise related to gene regulation (Supplementary Tables 3-4). Sequence comparison of 32,955 annotated genes in tomato and S. pimpinellifolium revealed 6,659 identical genes and 3,730 with only synonymous changes. A total of 22,888 genes had non-synonymous changes, including gains and losses of stop codons with potential consequences for gene function (Supplementary Tables 5-7). Several pericentric regions, predicted to contain genes, are absent or polymorphic in the broader S. pimpinellifolium germplasm (Supplementary Table 8, Supplementary Fig. 7). Within cultivated germplasm, particularly among the small-fruited cherry tomatoes, several chromosomal segments are more closely related to S. pimpinellifolium than to ‘Heinz 1706’ (Supplementary Figures 8-9), supporting previous observations on recent admixture of these gene pools due to breeding 12 . ‘Heinz 1706’ itself has been reported to carry introgressions from S. pimpinellifolium 13 , traces of which are detectable on chromosomes 4, 9, 11 and 12 (Supplementary Table 9). Comparison of the tomato and grape genomes supports the hypothesis that a whole-genome triplication affecting the rosid lineage occurred in a common eudicot ancestor 11 (Fig. 2B). The distribution of Ks between corresponding gene pairs in duplicated blocks suggests that one polyploidisation in the solanaceous lineage preceded the rosid-asterid (tomato-grape) divergence (Supplementary Fig. 10). Comparison to the grape genome also reveals a more recent triplication in tomato and potato. While few individual tomato/potato genes remain triplicated (Supplementary Tables 10-11), 73% of tomato gene models are in blocks that are orthologous to one grape region, collectively covering 84% of the grape gene space. Among these grape genomic regions, 22.5% have one orthologous region in tomato, 39.9% have two, and 21.6% have three, indicating that a whole genome triplication occurred in the Solanum lineage, followed by widespread gene loss. This triplication, also evident in potato (Supplementary Fig. 11) is estimated at 71 (+/-19.4) mya based on Ks of paralogous genes (Supplementary Fig. 10), and therefore predates the ~7.3 mya tomato-potato divergence. Based on alignments to single grape genome segments, the tomato genome can be partitioned into three non-overlapping ‘subgenomes’ (Fig. 2A). The number of euasterid lineages that have experienced the recent triplication remains unclear and awaits complete euasterid I and II genome sequences. Ks distributions show that euasterids I and II, and indeed the rosid-asterid lineages, all diverged from common ancestry at or near the pan-eudicot triplication (Fig. 2B), suggesting that this event may have contributed to formation of major eudicot lineages in a short period of several million years 14 , partially explaining the explosive radiation of angiosperm plants on earth 15 . Supplementary section 5 reports on the analysis of specific gene families. Fleshy fruits (Supplementary Fig. 12) are an important means of attracting vertebrate frugivores for seed dispersal 16 . Combined orthology and synteny analyses suggest that both genome triplications added new gene family members that mediate important fruit-specific functions (Fig. 3). These include transcription factors and enzymes necessary for ethylene biosynthesis (RIN, CNR, ACS) and perception (LeETR3/NR, LeETR4) 17 , red light photoreceptors influencing fruit quality (PHYB1/PHYB2) and ethylene- and light-regulated genes mediating lycopene biosynthesis (PSY1/PSY2). Several cytochrome P450 subfamilies associated with toxic alkaloid biosynthesis show contraction or complete loss in tomato and the extant genes show negligible expression in ripe fruits (Supplementary Section 5.4). Fruit texture has profound agronomic and sensory importance and is controlled in part by cell wall structure and composition 18 . More than 50 genes showing differential expression during fruit development and ripening encode proteins involved in modification of wall architecture (Fig. 4A and Supplementary Section 5.7). For example, a family of xyloglucan endotransglucosylase-/hydrolases (XTHs) has expanded both in the recent whole genome triplication and through tandem duplication. One of the triplicated members, SlXTH10, shows differential loss between tomato and potato (Fig. 4A, Supplementary Table 12), suggesting genetically driven specialisation in the remodelling of fruit cell walls. Similar to soybean and potato and in contrast to Arabidopsis, tomato sRNAs map preferentially to euchromatin (Supplementary Fig. 2). sRNAs from tomato flowers and fruits 19 map to 8,416 gene promoters. Differential expression of sRNAs during fruit development is apparent for 2,687 promoters, including those of cell wall-related genes (Fig. 4B) and occurs preferentially at key developmental transitions (e.g. flower to fruit, fruit growth to fruit ripening, Supplementary Section 2.8). The genome sequences of tomato, S. pimpinellifolium and potato provide a starting point for comparing gene family evolution and sub-functionalization in the Solanaceae. A striking example is the SELF PRUNING (SP) gene family, which includes the homolog of Arabidopsis FT, encoding the mobile flowering hormone florigen 20 and its antagonist SP, encoding the ortholog of TFL1. Nearly a century ago, a spontaneous mutation in SP spawned the “determinate” varieties that now dominate the tomato mechanical harvesting industry 21 . The genome sequence has revealed that the SP family has expanded in the Solanum lineage compared to Arabidopsis, driven by the Solanum triplication and tandem duplication (Supplementary Fig. 13). In potato, SP3D and SP6A control flowering and tuberisation, respectively 22 , whereas SP3D in tomato, known as SINGLE FLOWER TRUSS, similarly controls flowering, but also drives heterosis for fruit yield in an epistatic relationship with SP 23,24,25 . Interestingly, SP6A in S. lycopersicum is inactivated by a premature stop codon, but remains functionally intact in S. pimpinellifolium. Thus, allelic variation in a subset of SP family genes has played a major role in the generation of both shared and species-specific variation in Solanaceous agricultural traits. The genome sequences of tomato and S. pimpinellifolium also provide a basis for understanding the bottlenecks that have narrowed tomato genetic diversity: the domestication of S. pimpinellifolium in the Americas, the export of a small number of accessions to Europe in the 16th Century, and the intensive breeding that followed. Charles Rick pioneered the use of trait introgression from wild tomato relatives to increase genetic diversity of cultivated tomatoes 26 . Introgression lines exist for seven wild tomato species, including S. pimpinellifolium, in the background of cultivated tomato. The genome sequences presented here and the availability of millions of SNPs will allow breeders to revisit this rich trait reservoir and identify domestication genes, providing biological knowledge and empowering biodiversity-based breeding. Methods Summary A total of 21 Gb of Roche/454 Titanium shotgun and matepair reads and 3.3 Gb of Sanger paired-end reads, including ~200,000 BAC and fosmid end sequence pairs, were generated from the ‘Heinz 1706’ inbred line (Supplementary Sections 1.1-1.7), assembled using both Newbler and CABOG and integrated into a single assembly (Supplementary Sections 1.17-1.18). The scaffolds were anchored using two BAC-based physical maps, one high density genetic map, overgo hybridization and genome-wide BAC FISH (Supplementary Sections 1.8-1.16 and 1.19). Over 99.9% of BAC/fosmid end pairs mapped consistently on the assembly and over 98% of EST sequences could be aligned to the assembly (Supplementary Section 1.20). Chloroplast genome insertions in the nuclear genome were validated using a matepair method and the flanking regions were identified (Supplementary Sections 1.22-1.24). Annotation was carried out using a pipeline based on EuGene that integrates de novo gene prediction, RNA-Seq alignment and rich function annotation (Supplementary Section 2). To facilitate interspecies comparison, the potato genome was re-annotated using the same pipeline. LTR retrotransposons were detected de novo with the LTR-STRUC program and dated by the sequence divergence between left and right solo LTR (Supplementary Section 2.10). The genome of S. pimpinellifolium was sequenced to 40x depth using Illumina paired end reads and assembled using ABySS (Supplementary Section 3). The tomato and potato genomes were aligned using LASTZ (Supplementary Section 4.1). Identification of triplicated regions was done using BLASTP, in-house generated scripts and three way comparisons between tomato, potato and S. pimpinellifolium using MCscan (Supplementary Sections 4.2-4.4). Specific gene families/groups (genes for ascorbate, carotenoid and jasmonate biosynthesis, cytochrome P450s, genes controlling cell wall architecture, hormonal and transcriptional regulators, resistance genes) were subjected to expert curation/analysis, (Supplementary Section 5). PHYML and MEGA were used to reconstruct phylogenetic trees and MCSCAN was used to infer gene collinearity (Supplementary Section 5.2). Supplementary Material 1 2 3 4
              Bookmark
              • Record: found
              • Abstract: found
              • Article: not found

              The Pattern of Polymorphism in Arabidopsis thaliana

              Introduction The field of population genetics has always been heavily influenced by mathematical models. Ever since molecular polymorphism data started to become available, in the form of allozymes [1] or DNA sequences [2], population geneticists have searched for footprints of selection by comparing the patterns of polymorphism in particular genes with the pattern expected under standard neutral models [3,4]. Considerable intellectual effort has gone into estimating model parameters such as the mutation rate θ, the recombination rate ρ, and the effective population size, Ne [3,5]. However, because of the limited availability of data, it has been difficult to determine whether the underlying models are appropriate. For example, demographic factors such as population structure and growth can cause the genome-wide pattern of variation to deviate from standard neutral models in ways that mimic selection [4,6]. Thus, without knowing whether a standard neutral model describes the pattern of variation in most of the genome, it is difficult to conclude that a particular gene has been under selection. With the advent of high-throughput genotyping and sequencing, sufficient data for the critical appraisal of standard models are starting to become available, especially in humans [7–9]. Here we report our findings from a systematic survey of genomic DNA sequence polymorphism in Arabidopsis thaliana, one of the first in any organism. Our goal was to investigate the pattern of polymorphism in a large sample of individuals, using sufficiently densely spaced loci to obtain insight into the genome-wide haplotype structure of the species. The scale of our study allows us to describe the pattern of polymorphism with unprecedented accuracy. We begin by describing how variation is distributed, with respect to space (i.e., population structure) as well as with respect to haplotypes (i.e., linkage disequilibrium [LD]). Our set of 96 individuals contained hierarchical population samples in addition to a worldwide collection of stock center accessions (Tables S1 and S2): because of this and the large number of polymorphisms, we are able to answer a number of questions about population structure that previous studies have not been able to address. In the second part of the paper, we compare the pattern of variation to predictions made by standard population genetics models. The number of loci sequenced is sufficient to investigate the distribution of important summary statistics across the genome rather than simply looking at averages (as is usually done). Results/Discussion Sequencing A total of 876 reliable alignments, representing 0.48 Mbp of the genome (or a total over all individuals of ~44 Mbp) was generated. The average sequence length is 583 bp; the average sample size across alignments is 89. Based on the A. thaliana genome annotation, the composition of our data, which includes more than 17,000 single nucleotide polymorphisms (SNPs) and insertion/deletion polymorphisms, is 15% intergenic, 55% exon, 22% intron, 4% UTR, and 5% pseudogene (see Materials and Methods). The majority of fragments, 67%, contain both coding and noncoding sequence. Population Structure Overall levels of polymorphism Our estimates of the level of polymorphism are broadly comparable to what has previously been found in A. thaliana and other species, both in terms of overall levels of polymorphism, and in the degree of constraint on different kinds of sites (Figure 1; cf. [3]). The highly selfing A. thaliana does not have unusually low levels of polymorphism: the observed values are somewhat lower than for Drosophila melanogaster, and are considerably higher than for humans. A standard way of summarizing the geographical distribution of this variation is through the statistic FST, which, loosely speaking, measures the fraction of the observed genetic variation that is due to population structure [10]. Our sample contains 40 individuals that were hierarchically sampled in pairs from four populations in each of five regions (Table S1). A hierarchical analysis of variation in these individuals reveals that 33% of the global variation is segregating among individuals within populations, 35% is segregating among local populations within regions, and 26% is segregating among regions. Only 6% of the variation in the global sample is not captured by these 40 individuals. Even though only two individuals were sampled per population, and even though our estimates of within-population variance are upwardly biased (individuals were prescreened to avoid sequencing identical individuals; see Materials and Methods), our data clearly show that individual populations harbor much of the variation present species-wide. At the same time, there is strong population structure. Global geographic structure Studies of variation in A. thaliana have typically not found any correlation between the genotype and geographic origin of accessions. This has been attributed to a recent expansion of the species, perhaps in combination with human disturbance. However, early studies had little power to detect population structure, and a more recent, larger survey revealed weak isolation by distance [11]. Our study has several orders of magnitude more markers than any previous study known to us, and we find clear global population structure. We used a model-based clustering algorithm, implemented in the computer program Structure, to cluster our accessions on the basis of genotype [12]. Loosely speaking, the algorithm attempts to identify a predetermined number of clusters, K, that have distinctive allele frequencies, and assigns portions of individual genomes to these clusters. A genome can have membership in several clusters, and the algorithm reports the probability distribution of the assignment of each section of the genome. We analyzed the data by successively increasing K from two to eight (Figure 2). For K = 2, we see an East–West gradient, potentially attributable to post-glaciation colonization routes [11]. When we increaseK to three, all accessions from northern Sweden and Finland are assigned to a single cluster together with (to varying degrees) accessions from Eastern Europe, Russia, and Central Asia. While a relationship between northern Sweden and Tajikistan may seem far-fetched, several species are known to have colonized the Scandinavian Peninsula from both north (from Russia via Finland) and south (from Europe via Denmark) after the last glaciation [13]. As we increase K from three to eight, each new cluster splits a previously existing one along plausible geographical boundaries (and identifies some accessions as mixed). Thus K = 4 separates central European (Czech, Austrian, and Croatian) accessions from the main European cluster, K = 5 separates a subset of the United States accessions from the rest of Western Europe, K = 6 separates the Central Asian and Russian accessions from northern Sweden and Finland, K = 7 separates many German and southern Swedish accessions from the rest of Europe, and K = 8 identifies a Catalan cluster. Clusters identified for K > 8 did not contain the majority of any single individual genome. When these results are superimposed on a map (Figure 3), the pattern of isolation by distance becomes obvious. Individuals are, by and large, more similar to individuals that grow nearby than to individuals from far away. Although A. thaliana commonly occurs as a weed and human commensal, this has not been sufficient to erase population structure. Population structure is also evident in the distribution of pairwise differences between individuals. In the absence of population structure, all individuals should be equally closely related on average. As shown in Figure 4, this is clearly not the case. Not only is there generally a wider range of variation than would be expected in the absence of population structure, but there are also clear outliers. Some individuals are extremely closely related: these are typically from the same local population (and will be further discussed below). Perhaps more surprising is that two stock center accessions, Cvi-0 (Cape Verde Islands) (cf. [14]) and Mr-0 (Italy), are very different from all others (and each other). The distribution of pairwise differences can be conveniently summarized using hierarchical clustering. The population structure revealed by the resulting tree (Figure S1) generally agrees with the output of Structure. Variation within and between populations Because it is highly selfing, A. thaliana has often been considered a collection of asexual lineages, or “ecotypes.” This view is completely false. Indeed, even the first sequencing survey of variation showed clear evidence of recombination between accessions [15], and a subsequent study has shown that recombination has generally been sufficient to erode genome-wide LD on a very fine scale [16]. Consequently, there is no “phylogeny” of ecotypes. However, this still leaves open the possibility that local population structure is tree-like, with individuals within the same population being much more closely related to each other than to individuals from other populations. Indeed, since A. thaliana is a selfer, it is perfectly possible for a local population to consist of a single inbred sibship. We find that this is typically not the case; most sampled populations were polymorphic (even though this part of our study had relatively little power; see Materials and Methods), and when we consider the genome-wide data, the pattern that emerges is far from tree-like. As shown in Figure 2C and 2H, only a small fraction of all sequenced fragments have patterns of polymorphism compatible with monophyly in the sense that, with respect to that fragment, the members of a particular cluster are all more closely related to each other than to members of other clusters (the yellow “US” cluster is an exception to which we return below). Instead, just as is the case for human populations [17], most polymorphisms are shared between clusters, and levels of polymorphism are in general comparable within all clusters (see Figure 2E; for clarity, only values for K = 3 are shown). The same is true with respect to the local populations, although there is great variation between regions. An interesting way to describe the relationship between individuals is to try to identify chromosomal regions shared identical by descent by looking for long identical haplotypes [18–22]. The resulting patterns clearly reveal the difference in structure between geographic regions (Figure 5). In northern Sweden, pairs of accessions sampled from the same population are invariably more closely related to each other than to accessions from other populations; however, even here populations are far from monophyletic in the sense used above (i.e., for a particular locus, the closest relative may well come from a different population). In most regions (exemplified in Figure 5 by central Europe), haplotype sharing is moderate, and is not much greater within than between populations. The US sample is again different: here, pairs of accessions often appear to share entire chromosomes, and are equally likely to do so between populations. Regional variation in the level of population structure is also evident in the distribution of pairwise differences (see Figures 4 and S1). Individuals that are extremely closely related (less than ten differences) are almost always pairs from the same local population (two pairs from northern Sweden, one pair from Finland, and one from Germany). The one exception is a trio of nearly identical individuals from different US populations. Figure 2D and 2I illustrate the variation in FST across the genome for K = 3 and K = 8. This pattern is of interest in that regions with extremely high or low values may be seen as candidates for harboring selectively important loci [23,24]. Structure summarized The picture that emerges is that of a single, large, globally distributed population with historical gene flow sufficient to ensure that variation is shared worldwide, yet limited enough to cause considerable population structure. Genetic exchange is not only geographic; there has been enough outcrossing to ensure that LD decays within 25–50 kb on average (Figure 6), which is comparable to what has been observed in humans. All of this may seem surprising given the highly selfing nature of A. thaliana, but it is in fact completely compatible with theoretical predictions [16,25]. The only exception to this pattern comes from the US. Our sample from the US Midwest is clearly a heterogeneous collection, characterized by genome-wide LD and haplotype sharing. Especially notable is extensive haplotype sharing with accessions from other regions, in particular with the United Kingdom. All this strongly suggests that A. thaliana is a recent human introduction to the New World, and that the introduction severely reduced haplotype variation through bottleneck effects, causing genome-wide LD [16]. Since we have population samples from only the Midwest, we cannot rule out the possibility that the pattern is different in other parts of the US; however, we note that our one non-Midwestern US accession, Yo-0 (from Yosemite, California), is almost identical to some of our Midwestern accessions, and that a recent survey of variation in 53 US populations found no evidence of differentiation across the continent [26]. It should be emphasized that we view both Structure and hierarchical clustering as tools for exploring the data. The results should not be taken literally. For example, we do not believe that there are K = 8 random-mating populations in A. thaliana, as might be suggested by Figures 2 and 3, nor do we believe that populations are related in a tree-like manner, as might be suggested by Figure S1. Global Patterns Allele frequency distribution Note that θ̂S is consistently higher than θ̂P (see Figure 1). This is typically caused by an unusually high ratio of rare to common alleles, compared to what is expected under standard population genetics models [27]. A closer look at the distribution of allele frequencies reveals that this is indeed the case (Figure 7A). The observed distribution is skewed toward rare alleles compared to standard neutral expectations. A possible explanation for this is recent population growth [28]; however, the skew is much greater for nonsynonymous than for synonymous polymorphisms, suggesting that selective factors must also be involved. The effect of this genome-wide deviation from standard neutral models on “tests of selection” can be dramatic. These tests typically assume that the standard neutral model describes most of the genome, and interpret deviations at particular loci as signs of selection [3]. Our results show clearly that this procedure is not appropriate for A. thaliana (see also [29]). It is, of course, not a new finding that demographic history can invalidate tests of selection, but our results provide a striking illustration of the potential seriousness of the problem. For example, the mean value of one popular statistic, Tajima's D [27], is −0.8 rather than the (approximately) zero expected under simple neutral models, and the variance is also larger than predicted (Figure 7B). Positive values of Tajima's D are typically attributed to balancing selection: 2% of our fragments are significantly positive at the 1% level. Negative values are typically attributed to directional selection: 15% of our fragments are significantly negative at the 1% level. Although some of these deviations may, of course, actually be due to selection, our data suggest that tests based on standard cutoff values are anticonservative in both tails of the distribution. Consistent with this interpretation, a much higher fraction of studied genes have been reported to be under selection in A. thaliana than in other species [30]. Variation in the level of polymorphism While it is straightforward to fit a neutral model with growth to the observed distribution of Tajima's D (Figure 7B), the value of this exercise is doubtful. First, it is clear from Figure 7A that selection must be part of the reason for the skewed allele frequency distribution. Second, a model with growth would, in fact, fit other aspects of the data less well. In particular, population growth tends to reduce the variability in coalescence times across the genome compared to models with constant population size, resulting in less variation in the level of polymorphism between loci. We see the opposite: the variance between loci is considerably greater than expected under a standard neutral model with constant population size (Figure 7C). In addition, the distribution is heavily skewed and displays a long tail of extremely high values. There are several reasons to expect a poor fit to a simple neutral model. One is variation across the genome in the “neutral” mutation rate, θ, either due to variation in the level of selective constraint or due to variation in the actual, underlying mutation rate. Since the excess variability is observed equally for coding and noncoding DNA, we would have to invoke the latter. The extent to which variation in the mutation rate contributes to the pattern in Figure 7C can be estimated as soon as divergence data from a closely related outgroup species become available. Another factor likely to contribute to the variation in the level of polymorphism is population structure such as that described in the first part of the paper. It is well known that population structure can inflate the variance of coalescence times as well as induce a tail of very large values that would result in patterns of variability such as the one observed [6]. However, strong population structure is generally expected to push the distribution of Tajima's D toward positive values, which is the opposite of what we observe. It is possible that some model that involves growth (perhaps in conjunction with bottlenecks), structure, and finite sampling [7] could explain the pattern observed in A. thaliana, but we have not been able to find such a model (see also [29]). Genomic patterns of polymorphism It turns out that the pattern of polymorphism is affected by not only demographic forces, but also factors intrinsic to the genome. Figure 7D shows that polymorphism in noncoding regions is negatively correlated with local gene density. This mirrors the positive correlation between polymorphism and local recombination rates that was first noted in Drosophila [31] and has since been observed in a wide range of organisms [32]. A possible explanation is that recombination itself is mutagenic: this appears to explain the correlation observed in humans, but is not sufficient to explain the phenomenon in general [32,33]. Instead, it has been proposed that variation is reduced in low-recombination regions because of a “hitchhiking effect” due to selection on linked sites [34,35], in the form of either positive selection (“selective sweeps”) [36] or purifying selection (“background selection”) [37]. Such hitchhiking effects would be stronger in low-recombination regions because sites in these regions are affected by selection on larger pieces of the chromosome (i.e., more genes). Recombination is thus used as a proxy for gene density: the real prediction of these models is that polymorphism should decrease with gene density. This is precisely what we observe. The level of polymorphism is insignificantly positively correlated with recombination (suggesting that although recombination may well be mutagenic, this does not explain the phenomenon), but is strongly negatively correlated with gene density. Two factors suggest that background selection rather than selective sweeps is responsible for the correlation. First, unlike background selection, selective sweeps are expected to skew Tajima's D toward negative values. However, we find no correlation between Tajima's D and gene density. Second, it is clear from Figure 7A that a significant load of deleterious mutations (as is required by background selection) exists in A. thaliana. Figure 7E reveals that polymorphism is also positively correlated with segmental duplication, similar to what has been observed in humans [38–40]. In humans, the phenomenon appears to be largely due to misclassification of paralogous sequence variants as SNPs. Distinguishing between paralogous sequence variants and true SNPs is difficult for human data, where putative SNPs are typically detected in small samples of highly heterozygous individuals. In contrast, our data consist of high-quality sequences from a large sample of almost completely homozygous individuals, and we are therefore confident that nearly all of our polymorphisms are genuine (see Materials and Methods); fragments that have a close match elsewhere in the genome thus appear to be more variable than fragments that do not. We hypothesize that this is caused by a low level of intergenic gene conversion that serves to “shuffle” variation between loci. Such gene conversion has long been known to occur in large multigene families (“concerted evolution”; [41]): our results suggest that it may be a general phenomenon. Recombination and gene conversion We noted above (see Figure 6) that LD decays within 25–50 kb, somewhat faster than has previously been suggested [16]. At least 25% of our sequenced fragments show evidence of recombination (using the four-gamete test; [42]). Estimates based on coalescent models suggest an effective population recombination rate (e.g., [6]) of approximately ρ = 2 × 10−4 per basepair (V. P., B. P., P. Marjoram, J. W., and M. N., unpublished data). Given our estimates of the mutation rate θ (see Figure 1), this implies a ratio θ/ρ of about 20. The short-range pattern of LD in several species is incompatible with the long-range pattern; there is too little of the former relative to the latter for a simple recombination model that includes only crossing over to explain the data [43–48]. Possible explanations include gene conversion and multiple mutations (i.e., each SNP not being due to a unique mutation event), both of which will erode short-range LD [46]. There is clear evidence for both phenomena in our data. We observed a total of 315 tri-allelic SNPs. Since less than 50% of all multiple-hit mutations will result in more than two distinct alleles, this suggests that a total of more than 600 of our SNPs are, in fact, the product of multiple mutations. We also observed three fragments that show clear evidence for gene conversion in that a single gene conversion event (i.e., a double cross-over within 500 bp, as would result from the resolution of a single Holiday junction) suffices to explain a complicated pattern of polymorphisms based on multiple SNPs in the fragment. Coalescent-based analyses based on the fine-scale pattern of polymorphism suggest that gene conversion is about five times more common than crossing over, in agreement with previous population genetic analyses [48], as well as with direct estimates based on tetrad analysis [49]. Concluding Remarks We have shown that the pattern of polymorphism in A. thaliana, a selfing human commensal, generally agrees with what would be expected for a widely distributed sexually reproducing species. Although there is significant population structure, polymorphism is shared worldwide. As predicted by population genetics theory [25], the only clear indications of selfing in the pattern of polymorphism are that individuals are typically homozygous, and that LD is unusually extensive. The scale of our study allows us to consider the genomic distribution of statistics commonly used to summarize polymorphism data. We find that these distributions generally deviate significantly from what is assumed by standard population genetics models. This highlights the danger of using highly parameterized models based on untested assumptions for inference in population genetics. Commonly used “tests of selection” are simply not valid in A. thaliana (cf. [29,30]). Large-scale analyses in other organisms have similarly found genome-wide deviations from standard models (e.g., [43,50,51]). As data continue to accumulate, the focus of population geneticists will surely have to shift from rejecting null models that do not fit particular loci to finding models that actually do explain the bulk of the data. Genomic polymorphism data are required to develop more robust inference methods, and will enable us to study phenomena that are intrinsically genomic (e.g., the correlations in Figure 7D and 7E). More importantly, however, these data will help identify the functional polymorphisms that underlie phenotypic variation. The pattern of polymorphism in A. thaliana, characterized by humanlike levels of LD but much higher SNP density, coupled with the availability of naturally occurring inbred lines, makes the species ideal for LD mapping. Although the strong population structure is likely to cause a high rate of spurious genotype–phenotype associations, these problems can easily be overcome through direct experimental verification using crosses or transgenics. This significantly strengthens the position of A. thaliana as a model for evolutionary functional genomics. Materials and Methods Sampling The sample of 96 individuals included pairs of individuals from 25 local “populations” (typically sampled within a few hundred meters of each other, often much closer) as well as a worldwide survey of commonly used stock center accessions (Tables S1 and S2). Where possible, four populations were sampled from each of several regions. The sample was generated by screening a larger set of accessions with a small number of markers to avoid inbred siblings or extensively heterozygous individuals (E. B., E. Stahl, C. T., M. N., M. K., and J. B., unpublished data). Accessions were genotyped using 11 unlinked markers (five microsatellites, two indel R-genes, and four housekeeping genes with previously identified polymorphisms). To ensure that individuals sampled from local populations were not part of inbred sibships, four (three in one case) individuals from each of 37 populations were tested. Polymorphism was found in 25 of these populations, and a pair of nonidentical individuals was selected at random from each (Table S1). Some accessions not from the same population were also found to be identical with respect to these markers (Col-0 and Lp2-2; Ts-1 and Shahdara), but these were included nonetheless. Five accessions were found to be heterozygous and were eliminated. Four of these were from the population samples, and one, Ms-0, was from the stock center. Further testing of two additional Ms-0 lines revealed one more heterozygote and one homozygote, which was included. In spite of these precautions, one sequenced stock center accession, Van-0, turned out to be extensively heterozygous and was eliminated from the analyses in this paper (bringing the sample size to 95). Data generation We used direct, PCR-based sequencing of genomic DNA, with primers designed from the A. thaliana reference sequence to cover the genome relatively uniformly. To achieve uniform density of our fragments, the reference genome (releases January 7, 2002, and April 17, 2003) was first divided into equally spaced regions. The last 10 kb of each region then served as an input record to Primer3 (v. 0.6). The designed primer pairs returned from Primer3 for each region were then screened for uniqueness and quality. To screen for uniqueness, all primer pairs were BLASTed (BLAST v. 2.2.3) against both the reference genome as well as BAC datasets downloaded from the Arabidopsis Information Resource (http://www.arabidopsis.org/). Any primer pair that produced a hit in the same region (≤2,300 bps) was removed. Self-amplifying primers were also removed based on this same criterion. Additionally, primers with more than five BLAST hits against the reference were also discarded. To improve the quality of each fragment, any primer pair that amplified a target sequence that contained a homonucleotide run of nine bases or more was removed. All sequencing was done using ABI 3700 automated sequencers (Applied Biosystems, Foster City, California, United States). All fragments were sequenced in both directions. Chromatograms were initially base-called with Phred (v. 0.020425.c) and trimmed based on quality value. The start and end of each read was trimmed until the average quality value was 25 in a window of ten bases, and internal bases were converted to missing data when their quality value was below ten. Accessions missing one read of data were trimmed more severely (different setting were used). A combination of Phrap (v. 0.020425.c) and ClustalW (v. 1.82) was used for producing alignments using a modified weight matrix that allowed us to incorporate quality values into the ClustalW algorithm. Alignments were then visually inspected and adjusted as necessary using Consed (v. 13.0). Polyphred (v. 4.20) was used to flag potential heterozygotes, which were confirmed by visual inspection of chromatograms. Additional trimming was performed as necessary for accessions with multiple false polymorphisms and low-quality sequence after a visual inspection of chromatograms. Whenever two reads from the same accession disagreed, the final call was made by visually inspecting chromatograms unless the difference in quality value made the final call obvious. Potential polymorphisms in each alignment were then verified by a second person. All alleles found only once or twice in the sample were verified by visually inspecting the chromatograms. When this inspection did not reveal a chromatogram peak clearly different from the other accessions, the base was changed to missing data. This would, if anything, produce a slight underestimate in alleles of frequency one and two. Higher-frequency polymorphisms with generally low-quality values (20 or lower) were also verified by checking the chromatograms. A total of 876 high-quality fragment alignments were obtained from 979 PCR primers and used for the analyses in this paper. Of the remaining PCR primers, some failed at the stage of PCR amplification and sequencing, while some produced sequencing output that could not be base-called with certainty when the sequence quality was particularly low or when there was evidence that the primer pairs amplified two or more different products in some of the accessions. To calculate genetic distances, we used a set of markers that have been genetically mapped to the Lister and Dean recombinant inbred lines and that also can be mapped to the AGI reference genome. Some markers were removed so that both physical position and genetic position were monotonically increasing functions. All data are publicly available through our Web site (http://walnut.usc.edu/2010, and also as Dataset S1. Population structure To infer population structure and assign accessions to populations, we used a model-based clustering algorithm implemented in Structure v. 2.0 [12]. Since A. thaliana is largely homozygous, we used a haploid setting. We used the “linkage model” with “correlated allele frequencies” in Structure, where genetic distances (calculated by fitting a third-order polynomial to the Lister and Dean recombinant inbred mapping data) were used to indicate locus proximity. The algorithm was run with a burn-in length of 50,000 MCMC iterations and then 20,000 iterations for estimating the parameters. This was repeated ten times for each K (ranging from one to 17). In these analyses, each fragment-haplotype was treated as a marker at a multiallelic locus, so that two accessions had a different type if they differed at any site in the fragment. The likelihood of the data increases with K from K = 1 until K = 7 (using the Wilcoxon two-sample test to compare the ten runs for each K; two-sided p = 0.001 for K = 7 versus K = 6). The likelihoods of K = 7 and K = 8 were similar (two-sided p = 0.97). For K > 7, the likelihoods of different runs were more variable than for K ≤ 7, with the added variability caused only by runs with lower likelihoods. Moreover, the additional clusters for K > 8 do not have a majority of the genome for any of the accessions. These observations taken together indicate that it is less meaningful to choose K > 8. In displaying the output from Structure, we computed an average of the ten runs for each K. Because there are K! distinct permutations of the clusters that all correspond to equivalent assignments of membership coefficients to accessions, and because independent runs may produce different permutations, to compute an average we first permuted the clusters to align the solutions. For R runs, there are (K!) R − 1 ways of aligning clusters across runs. To determine which of the clusters of each of the other runs corresponds to a specific cluster in a given run, the symmetric similarity coefficient (SSC) was used with the matrices of membership coefficients (based on the genome-wide average). For a given K, the SSC was calculated for all combinations of pairs of runs: where Q i and Q j are the membership matrices of runs i and j (i ≠ j), P is a permutation; the minimum is taken over all permutations, S is a probability matrix of K columns where all elements equal 1/K, and A F is the Frobenius matrix norm [52]. This is a slight adaptation of the asymmetric similarity coefficient used in previous work [17]. For K = 2, the runs were permuted to the arrangement that maximizes the sum of SSC across pairs of runs, and an average of the membership matrices across runs was then taken. For K > 2, it was not feasible to test all possible arrangements; therefore, the following greedy algorithm was used. (1) Fix a permutation, P 1, of one (randomly chosen) run, . (2) Randomly choose a second run, Q2, and fix the permutation, P 2, that maximizes . (3) Continue sequentially with each remaining run, Q x , where x = 3,…, R, and fix the permutation, Px , that maximizes for the current run, Q x . Because the choice of starting run can affect the result, we tested all ten possibilities for the starting run. For K = 2 to K = 8, there were thus 70 possible ways of starting the algorithm, and in only two of 70 possible cases was a different result obtained. These two solutions differed from the common solution by switching one pair of clusters in one run (2.5% of the clusters differed from the common solution), and switching one pair of clusters in two different runs (5%). We tested for monophyly as follows. For every variable site in a fragment, each cluster was checked for the presence of both alleles as well as for the presence of both alleles outside the cluster. If a variable site in a fragment had both alleles within the cluster as well as outside the cluster, then the whole fragment was deemed nonmonophyletic for that specific cluster. Clusters that failed to show nonmonophyly for a fragment were considered monophyletic for that fragment. Fragments with less than five variable sites and clusters with less than five accessions were always considered to be nonmonophyletic. FST for the inferred clusters was computed as: where θ̂P, total is the average number of pairwise differences per site for all pairs of accessions, and θ̂P, within is the average number of pairwise differences per site for all pairs within cluster i. Of the total of 95 accessions, 40 were hierarchically sampled in pairs from four populations in each of five regions (Table S1). The total amount of variation among these 40 accessions, θ̂P, among40 , was computed by taking the total average pairwise difference for all pairs of the 40 accessions, whereas the amount of variation within populations, θ̂P, withinpop , was calculated by taking the mean of the total average pairwise difference for the pairs of accessions in the 20 populations. The level of variation among geographical regions, θ̂P, amongreg , was computed as the difference between θ̂P, among40 and the mean of the total average pairwise differences for all pairs of accessions within regions. The level of variation among populations, θ̂P, amongpop , was calculated from the following expression: Genomic patterns of polymorphism Correlations were identified between levels of polymorphism and local gene density or degree of duplication (Figure 7). The local gene density was measured as open reading frames per centimorgan in windows of size greater than or equal to 1 Mb (using genetically mapped markers from the Lister and Dean recombinant inbred data as endpoints). The number of open reading frames (excluding pseudogenes and RNA genes) from the annotated reference sequence that fell between these window endpoints was counted, and length in centimorgans of each window was estimated from the genetic distance of the markers used as window endpoints. Correlations were quantified using Spearman's rank correlation, and the significance of the observed values was evaluated using 50,000 permutations that maintained the chromosomal order of all observations but that shuffled the relative positions of the two variables. (For each variable, the lists representing the consecutive values within each chromosome were concatenated in random order and direction to form a circle. The two circles were then randomly aligned with each other.) This is necessary to avoid inflated significance values due to autocorrelations along the chromosomes (of both variables). Using this procedure, the rank correlation between θ̂S in nonexon sequences and gene density is −0.27 (p = 0.0014), and the rank correlation between θ̂S in nonexon sequences and the negative log of the second-best BLAST e-value is 0.13 (p = 0.0018). To investigate the effect of population structure, all analyses (except those of population structure) were repeated with the outliers in Figure 4 removed (Cvi-0, Mr-0, and all but one randomly chosen member of each closely related group). All conclusions remain qualitatively the same. Supporting Information Dataset S1 All Data Used in the Paper (912 KB ZIP). Click here for additional data file. Figure S1 Hierarchical Clustering of Individuals Based on Pairwise Differences (12 MB EPS). Click here for additional data file. Table S1 The Population Samples Used in the Project (8 KB PDF). Click here for additional data file. Table S2 The Individual Accessions Used in the Project (8 KB PDF). Click here for additional data file.
                Bookmark

                Author and article information

                Journal
                0410462
                6011
                Nature
                Nature
                Nature
                0028-0836
                1476-4687
                27 July 2016
                12 May 2013
                6 June 2013
                03 August 2016
                : 498
                : 7452
                : 94-98
                Affiliations
                [1 ]Laboratorio Nacional de Genómica para la Biodiversidad (LANGEBIO), Centro de Investigación y de Estudios Avanzados del Instituto Politécnico Nacional (CINVESTAV), 36821 Irapuato, Guanajuato, México
                [2 ]The School of Plant Sciences and iPlant Collaborative, University of Arizona, Tucson, Arizona 85721, USA
                [3 ]Departamento de Alimentos, División de Ciencias de la Vida, Universidad de Guanajuato, 36500 Irapuato, Guanajuato, México
                [4 ]Department of Biological Sciences, University at Buffalo, Buffalo, New York 14260, USA
                [5 ]Department of Biology, Chongqing University of Science and Technology, 4000042 Chongqing, China
                [6 ]Departamento de Genética, Unidad Irapuato, Centro de Investigación y de Estudios Avanzados del Instituto Politécnico Nacional (CINVESTAV), 36821 Irapuato, Guanajuato, México
                [7 ]Instituto de Biotecnología y Ecología Aplicada, Universidad Veracruzana, 91090 Xalapa, Veracruz, México
                [8 ]Department of Plant Biology, Michigan State University, East Lansing, Michigan 48824, USA
                [9 ]Centro Universitario de la Ciénega, Universidad de Guadalajara, 47840 Ocotlán, Jalisco, México
                [10 ]Center for Comparative Genomics and Bioinformatics, Pennsylvania State University, University Park, Pennsylvania 16802, USA
                [11 ]Singapore Centre on Environmental Life Sciences Engineering, Nanyang Technological University, 637551 Singapore
                [12 ]Centre for Genomic Regulation (CRG), 08003 Barcelona, Spain
                [13 ]Universitat Pompeu Fabra (UPF), 08018 Barcelona, Spain
                [14 ]Max Planck Institute for Molecular Genetics, 14195 Berlin, Germany
                [15 ]Department of Biology, Indiana University, Bloomington, Indiana 47405, USA
                [16 ]Waksman Institute of Microbiology and Department of Plant Biology and Pathology, Rutgers University, New Brunswick, New Jersey 08854, USA
                [17 ]The Donald Danforth Plant Science Center, St. Louis, Missouri 63132, USA
                Author notes
                Correspondence and requests for materials should be addressed to V.A.A. ( vaalbert@ 123456buffalo.edu ) or L.H.-E. ( lherrera@ 123456langebio.cinvestav.mx )
                Article
                NIHMS805566
                10.1038/nature12132
                4972453
                23665961
                856c7dc2-f350-4fb1-814d-1f7040f42951

                Reprints and permissions information is available at www.nature.com/reprints. This paper is distributed under the terms of the Creative Commons Attribution-Non-Commercial-Share Alike licence, and is freely available to all readers at www.nature.com/nature.

                This work is licensed under a Creative Commons Attribution-NonCommercial-Share Alike 3.0 Unported licence. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-sa/3.0

                History
                Categories
                Article

                Uncategorized
                Uncategorized

                Comments

                Comment on this article