Understanding the genetic basis of physiological, developmental and morphological variation in domesticated Asian rice (Oryza sativa) is critical for improving the quality, safety, reliability and sustainability of the world's food supply. Human population growth, particularly in developing countries where rice is the main source of caloric intake1, coupled with climate change and the intensive water, land and labour requirements of rice cultivation2, creates a pressing and continuous global need for new, stress tolerant, resource-use efficient, and highly productive rice varieties. To assist in this endeavour, the scientific community has created a wealth of genomic and plant breeding resources, including high-quality genome sequences3 4, dense SNP maps5 6 7, extensive germplasm collections6 8 9 and public databases of genomic information5 6 10 11. Despite the availability of these scientific resources, most of what we know about the genetic architecture of complex traits in rice is based on traditional quantitative trait locus (QTL) linkage mapping using bi-parental populations. While providing valuable insights12, the QTL approach is clearly not 'scalable' to investigate the genomic potential and tremendous phenotypic variation of the more than 120,000 accessions available in public germplasm repositories. Genome-wide association study (GWAS) mapping makes it possible to simultaneously screen a very large number of accessions for genetic variation underlying diverse complex traits. An extra advantage of the GWAS design for rice is the homozygous nature of most rice varieties, which makes it possible to employ a 'genotype or sequence once and phenotype many times over' strategy, whereby once the lines are genomically characterized, the genetic data can be reused many times over across different phenotypes and environments. Here we present a genome-wide association study in a global collection of 413 diverse rice (O. sativa) varieties from 82 countries using a high-quality custom-designed 44,100 oligonucleotide genotyping array. For these varieties, we systematically phenotyped 34 morphological, developmental and agronomic traits over two consecutive field seasons. Our mapping strategy evaluated variation both within and among four of the major subgroups of rice, revealing significant heterogeneity of genetic architecture among groups, as well as gene-by-environment effects. Unlike previous GWAS studies in rice5, purified seed stocks of the rice strains and all the genotypic and phenotypic information generated over the course of this study are publicly available, creating a valuable, open-source translational research platform that can be rapidly expanded through community participation to enhance the power and resolution of GWAS in rice. Results Diversity panel and genotyping array A rice diversity panel consisting of 413 inbred accessions of O. sativa collected from 82 countries (Fig. 1; Supplementary Data 1) was genotyped using an Affymetrix single nucleotide polymorphism (SNP) array containing 44,100 SNPs (hereafter referred to as the 44 K chip). With a genome size of ∼380 Mb (ref. 13), this custom-designed genotyping chip provides high quality data (less than 4.5% missing data), with ∼1 SNP per 10 kb across the 12 chromosomes of rice (Methods; Supplementary Data 2). The diversity panel was evaluated for 34 traits related to plant morphology, grain quality, plant development and agronomic performance using field-grown plants with replications within and between years (Supplementary Table S1; Supplementary Data 3). Population structure and linkage disequilibrium estimation in rice Using principle component analysis (PCA)14 to summarize global genetic variation in the diversity panel, we observed clear, deep subpopulation structure in this collection of germplasm (Fig. 1a). The top four principal components (PCs) explained almost half of the genetic variation (Fig. 1b). The five subpopulations indica, aus, temperate japonica, tropical japonica and aromatic formed clear clusters based on the top four PCs, and were well differentiated from each other, with pairwise Fst (F-statistic) values ranging from 0.23–0.53. This is in agreement with previous findings where global germplasm collections have been used in combination with much smaller numbers of SNP or simple sequence repeat (SSR) genotypes8 15 16 17. Because the array was designed to assay variation in all O. sativa groups, most SNPs are shared or polymorphic across subpopulations (Table 1). We examined allele sharing across the panel by calculating 'identity by state' coefficients among all pairs of accessions (Fig. 2a). We find that whereas allele sharing clearly tracks subpopulation ancestry as identified by the PCA analysis, there is also a substantial number of admixed accessions, highlighting the complex history of rice varieties grown throughout the world16. Excluding the small sample of aromatic accessions, the mean observed identical by state (IBS) sharing is greatest between the closely related tropical japonica and temperate japonica accessions (0.80), followed by indica and aus (0.64), with relatively little IBS sharing between the two major subspecies, Indica and Japonica (0.47) (Fig. 2a). The fact that most of the admixture occurs within (rather than between) subspecies underscores the existence of genetic and cultural barriers to genetic exchange between these two major groups of Asian rice, despite documented cases of targeted Japonica-Indica introgression mediated by artificial selection18 19. The amount of genomic variation tagged by our SNP array was calculated by measuring the pairwise SNP linkage disequilibrium (LD) among the 44 K common SNPs. On average, LD drops to almost background levels around 500 kb–1 Mb, reaching half of its initial value at ∼100 kb in indica, 200 kb in aus and temperate japonica, and 300 kb in tropical japonica (Supplementary Fig. S1). Given that our average inter-marker distance is 10 kb, we expect to have reasonable power to identify common variants of large effect associated with our traits of interest, even if we have not queried the causal variant for association in the domesticated varieties. Phenotypic variation The phenotypes we examined in our GWAS can be classified broadly into six categories: plant morphology-related traits; yield-related traits; seed and grain morphology-related traits; stress-related phenotypes; cooking, eating and nutritional-quality-related traits; and plant development, represented by flowering time, which we measured in three geographic locations that differed in day-length and ambient temperature. Canonical correlation analysis demonstrated that phenotypes within a category are often correlated, ranging from a low of −0.41 between brown rice seed width and brown rice seed length, to a high of 0.9 between hulled and dehulled seed morphology (Fig. 2b; Supplementary Fig. S2). For all the phenotypes evaluated in this study, we observed global similarities among members of the same subpopulation, consistent with the domestication and breeding history of these varieties. Correlation coefficients between accession pairs across all phenotypes were significantly higher for accession pairs from the same subpopulation than from different subpopulations (P 20 Kb between the tagging SNPs. This generated a well-distributed SNP array providing ∼1 SNP every 10 kb along the 12 chromosomes of rice. The microarray data has been deposited in the NCBI dbSNP Database under the accession codes 469281739 to 469324700. Target probe preparation and 44 K SNP array hybridization Rice genomic DNA was extracted from young green leaf tissue following Qiagen plant DNeasy protocol. The probe was generated using the BioPrime DNA labeling kit (Invitrogen, Cat. No: 18094-011), and hybridization conditions were based on the Affymetrix SNP 6.0 protocol. Approximately 750 μg to 1 μg of rice genomic DNA was labelled overnight at 25 °C using 3 vol of the BioPrime DNA labelling reactions. The labelled DNA was ethanol precipitated, resuspended in 40 μl H2O, and then added to the Affymetrix SNP 6.0 hybridization cocktail. We did not include Human Cot-1 DNA because of the small size of the rice genome and the fact that it has a much smaller proportion of repetitive DNA compared with human or other mammals for which the assay was originally optimized. SNP genotype calling Genotypes are called using our program ALCHEMY, which was designed to provide improved performance in small sample sizes and for inbred populations with very low levels of heterozygosity60. SNPs with low quality (that is, low call rate and allele frequency) across all samples were removed from the dataset and 36,901 high-performing SNPs (call rate >70%, minor allele frequency >0.01) were used for all analyses. Of these SNPs, inbred samples had a median call rate of 95.9% and pairwise concordances between technical replicates yielded >99% average pairwise concordance and >92% average call rate. Plant materials The Rice Diversity Panel consists of 413 Asian rice (O. sativa) cultivars, including many landraces, which originated from 82 countries, representing all the major rice-growing regions of the world15. The panel contains 87 indica, 57 aus, 96 temperate japonica, 97 tropical japonica, 14 groupV/aromatic, and 62 highly admixed accessions. All accessions were purified for two generations (single seed descent) before DNA extraction. In all, 20 of these 413 accessions were purified as part of the OryzaSNP project6. Six cultivars (Azucena, Moroberekan, Nipponbare, Dom-Sofid, IR64, M-202) were purified separately, once by Ali et al.15 and once as part of the OryzaSNP panel. Further information for each accession (accession name, accession number, country of origin and subpopulation ancestry based on PCA) is given in Supplementary Data 1. Phenotypic evaluation and correlation among individuals Rice accessions were evaluated in the field at Stuttgart, Arkansas during the growing season (May–October) in 2006 and 2007. Two replications per year were grown in a randomized complete block design in single-row plots of 5 m length with a spacing of 25 cm between the plants and 0.50 m between the rows. A brief description of each trait, its acronym, and evaluation methodology are summarized in Supplementary Table S1. Phenotypic correlations between individuals were calculated based on all phenotypes used in our study. Estimation of LD decay in rice The amount of genomic variation tagged by our SNP array was calculated by measuring the pairwise SNP linkage disequilibrium (LD) among the 44 K common SNPs (with MAF>0.05) using r 2, the correlation in frequency among pairs of alleles across a pair of markers. For all pairs of autosomal SNPs, r 2 was calculated using the --r2 --ld-window 99999 --ld-window-r2 0 command in PLINK61. Of the more than 44,100 SNP variants we assayed, we found 34,454 (∼78%) with minor allele frequency >0.05 across the O. sativa panel. When calculated across the entire O. sativa panel, LD is small at short distances (r 2<0.45 at 5 kb) but then decays more slowly, and still shows substantial residual LD at a distance of 2 Mb, reflecting the deep subpopulation structure (Supplementary Fig. S2). Within each subpopulation, we calculated r 2 between all pairs of SNPs where both SNPs had <20% missing data and MAF≥5%. Population structure Principal component analysis was done using EIGENSOFT14. PC1 separates the samples into two main subspecies- Indica and Japonica and explains 34% of the genetic variance whereas PC2 separates indica from aus and explains 10% of the variance. We find that PC3 separates the two japonica groups into temperate and tropical components (∼6% of the variance), and PC4 identifies the aromatic group as a clear and distinct gene pool (∼2% of the variance). (1- IBS) values were used as the distance between individuals to construct the hierarchical clustering tree using complete linkage method in Figure 2a. Genome-wide association Association analyses were performed with and without correcting for population structure. A mixed model approach implemented in EMMA22 was used to correct the confounding of population structure. The relatedness matrix, measured as the genetic similarity between individuals and IBS values (that is, proportion of times a given pair of accessions had the same genotype across all SNPs), was used to estimate random effects. For all samples, SNPs and the top four PCs were used as fixed effects; for association analysis within each subpopulation, only SNPs were used as fixed effects in the model. For analyses without confounding, simple linear regression or logistic regression was used for continuous and binary traits, respectively. All statistical model details are described in the Supplementary Method. Unless explicitly mentioned, when two-year data were available, mean values across replicates and years of phenotypes were used in association analysis throughout the paper. To examine the effect of 'year' on GWAS results, we introduced 'year' as a covariate in the mixed model, along with the SNPs and 4 PCs. We graphed the correlation between P-values using the two-year-phenotypic mean and using 'year' as a cofactor in the model for flowering time and flag leaf length (Supplementary Fig. S41). When examining GxE effects across locations, only 2007 flowering time data from Arkansas was used for consistency with single-year data from the other locations. Candidate genes near hits were extracted from the literature. Rice homologues of Arabidopsis flowering time genes were extracted from the Gramene Database (www.gramene.org). Phenotypic variance contribution of significant loci To obtain significant loci from EMMA for each phenotype, all significant SNPs within 200 Kb were consolidated into one lowest P-value SNP to remove linkage disequilibrium. Large LD regions such as Hd1 were also consolidated into one single, most significant SNP. Only continuous traits were considered for variance contribution estimation. SNP contribution to the phenotypic variance was estimated using ANOVA with the R package; statistical model details are provided in the Supplementary Method. Author contributions K.Z. and C.-W.T. contributed equally to the work, and C.D.B. and S.R.M. co-supervised the project. K.Z., C.-W.T., S.R.M., C.D.B., G.C.E. and A.M.M. conceived and designed the experiments. . K.Z., C.-W.T., M.H.W., G.C.E., M.L.A., G.J.N., R.I., A.M.M. and A.H.P. performed the experiments. K.Z., C.-W.T., M.H.W., A.R. and S.R.M.,analysed the data. K.Z., C.-W.T., M.H.W., G.C.E., A.M.M., J.M. and S.R.M. contributed reagents/materials/analysis tools. K.Z., C.-W.T., J.M., C.D.B., S.R.M. wrote the paper. Additional information Accession codes: The microarray data has been deposited in the NCBI dbSNP Database under the accession codes 469281739 to 469324700. How to cite this article: Zhao, K. et al. Genome-wide association mapping reveals a rich genetic architecture of complex traits in Oryza sativa. Nat. Commun. 2:467 doi: 10.1038/ncomms1467 (2011). Supplementary Material Supplementary Information Supplementary Figures S1-S41, Supplementary Table S1 and Supplementary Method. Supplementary Data 1 List of O. sativa accessions and country of origin and principle component assignment. Supplementary Data 2 SNP information for the rice 44K SNP array. Supplementary Data 3 Phenotypic data for the 34 traits used for the genome wide association study. Supplementary Data 4 Summary of the percent phenotypic variance explained by significant loci from EMMA for 28 quantitatively inherited traits.