An Exhaustive Epistatic SNP Association Analysis on Expanded Wellcome Trust Data

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

We present an approach for genome-wide association analysis with improved power on the Wellcome Trust data consisting of seven common phenotypes and shared controls. We achieved improved power by expanding the control set to include other disease cohorts, multiple races, and closely related individuals. Within this setting, we conducted exhaustive univariate and epistatic interaction association analyses. Use of the expanded control set identified more known associations with Crohn's disease and potential new biology, including several plausible epistatic interactions in several diseases. Our work suggests that carefully combining data from large repositories could reveal many new biological insights through increased power. As a community resource, all results have been made available through an interactive web server.

Related collections

Most cited references 10

Record: found
Abstract: not found
Article: not found

Improved linear mixed models for genome-wide association studies.

Christoph Lippert, C Kadie, I. Davidson … (2012)

0 comments Cited 138 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

Two-Stage Two-Locus Models in Genome-Wide Association

David M Evans, Jonathan Marchini, Andrew Morris … (2006)

Introduction There is growing evidence supporting an important role for epistasis in the etiology of complex traits. Studies employing model organisms such as Drosophilla melanogaster and Saccharomyces cerevisiae (yeast) have suggested that epistasis occurs frequently, involves multiple loci, and in some cases produces effects as large as the main effects at the individual loci [1–4]. Although there is growing appreciation that searching for epistatic interactions in humans may be a fruitful endeavor, there is no consensus as to the best strategy for their detection, particularly in the case of genome-wide association (GWA) where the number of potential comparisons is enormous [5–7]. Recently, Marchini et al. [8] demonstrated that despite the considerable penalty introduced by multiple testing, it was possible to identify interacting loci that increased odds of disease using realistic sample sizes. Interestingly, searching over all possible pairs of loci was often more powerful than performing a single-locus scan across the genome in cases where the loci interacted epistatically. However, Marchini et al. examined three underlying disease models which represent only a small fraction of the entire parameter space of possible models. Additionally, Marchini et al. explicitly avoided models with little or no main effects at the margins. In these situations, single-locus searches are likely to fail and pairwise or higher-order searches will be necessary in order to detect loci [9–11]. A number of recent studies in humans and animals have identified loci that interact significantly but contribute little or no effect at the margins [12–29]. Such scenarios are challenging and worthy of further investigation since much of the current gene mapping methodology depends on the assumption of non-negligible main effects. If models that exhibit negligible marginal effects are common, then this will have significant consequences for how we go about searching for the genetic basis of complex phenotypes. Even if testing all possible pairwise comparisons can be justified on theoretical grounds, there are still a number of practical difficulties associated with performing such a large number of statistical tests (i.e., storage requirements, computation time, etc.). For example, a smallish scan consisting of only 100,000 markers would entail 100,000C2 (i.e., 100,000 “choose” 2), or approximately 4.9 × 109 comparisons, which might take several days on a typical workstation. Thus, rather than testing all possible pairwise comparisons, a more practical strategy might be to examine a subset of loci which could influence the trait. An obvious method of selecting loci is to evaluate their performance first on a single-locus test of association. That is, loci that meet some low threshold in a single-locus test are subsequently followed up in a two-locus analysis [8]. Intuitively, such an approach might also provide an advantage in power since the penalty due to multiple testing would not be as great as in an exhaustive pairwise search. We therefore investigated the performance of two simple two-stage strategies. In the first strategy, only loci that met an initial threshold were included in subsequent testing. In the second strategy, any locus that met the first-stage threshold was subsequently tested with all other markers across the genome regardless of whether these other markers met the initial threshold. In summary, our manuscript expands on earlier work by (a) examining the performance of a two-locus and single-locus search across the genome for an extensive range of genetic models, (b) investigating the performance of each of these strategies using quantitative rather than disease traits, and (c) characterizing the performance of two, two-stage strategies which reduce the computational and multiple testing burden associated with an exhaustive two-locus search across the genome. Based on our extensive simulations, we demonstrate that an exhaustive two-locus search is more powerful than a single-locus strategy when loci interact for many of the situations considered, and is capable of detecting interacting loci that contribute to moderate proportions of the phenotypic variance using realistic sample sizes. In addition, we also show that an exhaustive search involving all pairwise combinations of markers across the genome is preferable to analyzing the data using a two-stage procedure that first conditions on one or both of the loci meeting some marginal level of significance in a single-locus test. Our results suggest that an exhaustive pairwise search of markers across the genome may provide a useful complement to single-locus scans in identifying interacting loci that contribute to moderate proportions of the phenotypic variance. Results Figure 1 presents an illustrative selection of results in order to highlight some general features of the data (a full list of results for all models and conditions can be found in Dataset S1). The first feature that should be apparent is that only 1,500 individuals are required in order to detect loci responsible for moderate proportions of the phenotypic variance (i.e., 5%) with appreciable power (approximately 80%) using an exhaustive pairwise search across the genome. As expected, when there was no interaction between the loci, the power to detect either locus using a single-locus search was greater than the power to detect both loci using the two-locus strategy. However, the power to detect both loci using a single-locus strategy was actually less than the power of the two-locus search—even when there was no epistasis, a result which also held with different numbers of markers across the genome (results not shown). Given this result, it is perhaps not surprising that when epistasis was present, the power to detect both loci was always greatest using the two-locus strategy. This reflects the usual situation in statistics where the most powerful test is the one that encompasses the true underlying model. However, much more interesting is the comparison between the power of the two-locus search and the power to detect either locus via single-locus scan (Figure 1). For many of the models, the power to detect either locus using a single-locus strategy was greater than the power of a two-locus search for the majority of allele frequencies considered. However, for a small to moderate proportion of the space of possible allele frequencies, a two-locus strategy actually performed better than a single-locus search. These situations represent cases where the combination of allele frequencies is such that the majority of the genetic variance resides in the epistatic variance component, and hence the loci cannot be identified via single-locus tests of association (see below). Interestingly, Figure 1 also shows that for a small number of models, the power to detect either locus using a single-locus strategy is actually less than the power of a two-locus search for the majority of the parameter space of allele frequency combinations. These models tended to be the more exotic looking ones (e.g., M170, which requires an individual to be heterozygous at one locus and homozygous at the other in order to display the increased phenotype). Models such as these are difficult to explain via simple additive and dominance effects and thus are not amenable to single-locus tests of association. Figure 2 illustrates a partitioning of the variance for a simple quantitative trait model (M27). Under this model, an individual requires at least one copy of the increaser allele at both loci in order to increase the quantitative phenotype above baseline levels (one could imagine how this could occur via simple biological process). Notice in particular how the proportion of the genetic variance in each of the different components varies with changes in the allele frequencies. For example, when the “A” and “B” alleles are both common, the majority of variance resides in the epistatic component, and there is little effect at the margins. In this situation, single-locus tests of association might fail to detect the loci even though they clearly influence the quantitative trait. It is only when a two-locus model which explicitly models the interaction between both contributing loci is fit to the data that the true underlying relationship becomes apparent and both loci can be identified [8]. In fact, it is sobering to realize that many simple-looking models (e.g., M1, M3, etc.) also contain sizeable regions of the space of possible allele frequencies where the epistatic variance component is appreciable (see the bottom of Figure 2 for some more examples). The implication is that these situations might also be common in reality and therefore frustrate attempts to localize genes in human populations via simple single-locus approaches. Figure 3 illustrates the performance of each of the two-locus strategies for the same model depicted in Figure 2, which is also representative of many of the situations considered. In the case of the Both Significant Two-Stage Strategy (i.e., where both loci were required to meet a first-stage threshold in the single-locus scans in order to test the full interaction model), as the threshold became increasingly stringent, there was a decrease in power across the majority of the space of possible allele frequencies. However, for a small proportion of this space (i.e., when the allele frequencies at both loci were similar in the case of this model), there was a small increase in power, so long as the first-stage threshold was not too strict (the exact level varied across models). This pattern of results was similar across all of the models considered. In contrast, the power of the Either Significant Two-Stage Strategy (i.e., where only one locus was required to meet a first-stage threshold in order to test the full interaction model) was dependent on the type of model under which the data were simulated. For example, in the case of more exotic models (e.g., M170), there was a decrease in power across the majority of the space of possible allele frequencies even at liberal first-stage cutoff levels (i.e., p kl + km (where km = 0 in the case of the Either Significant Two-Stage Strategy). We therefore defined a new statistic = R(l,m) – (kl + km ) and assessed the significance of this statistic against a distribution in which d′ is the degrees of freedom of the full model fitted at the two loci. In the tests, the null hypothesis being tested is that both loci are not associated with the phenotype. We set the level of significance using a Bonferroni correction based on the expected number of tests to be performed, yielding the same overall error rate in each strategy, (α/ α 1 L C 2) in the case of the Both Significant Two-Stage Strategy and [α/( LC 2 − (L − α 1 L) C 2)] in the case of the Either Significant Two-Stage Strategy. Through simulation we found this procedure to provide an accurate test of interaction between two loci [8]. Partitioning the model into expected variance components. In order to better understand the performance of the different strategies, it is instructive to consider how the total genetic variance for any two-locus model can be divided into three mutually exclusive components: the genetic variance at locus 1, the genetic variance at locus 2 and the epistatic (or interaction) variance (the derivation of these components is well known [42–46] and we refer the interested reader to these articles as well as several excellent texts which cover the subject [47,48]). The partitioning is important because it represents the amount of genetic variance (and hence statistical power) which can be captured by a single-locus test of association at locus 1, a single-locus test of association at locus 2, and the amount of genetic variance which cannot be captured by single-locus tests, respectively. The single-locus components represent the effects of a locus averaged over all other loci (i.e., the ‘marginal' effects), and are equivalent to the combined additive and dominance components of classical quantitative genetics [49]. The epistatic component arises because of nonadditive interactions between the loci and represents the remainder of the genetic variance which is not accounted for by the single-locus components. The sum of all three components is the amount of genetic variance which can be captured by a two-locus test of association. To illustrate this partitioning formally, consider the 3 × 3 matrix of genotypic means and their frequencies in Table 1. Each row displays the genotype at the first locus, whereas each column indexes the genotype at the second. Each cell contains a genotypic mean and its respective frequency (under Hardy-Weinberg equilibrium). Given the genotypic means and frequencies at both loci, it is possible to calculate the mean (μ) and total genetic variance of the system (σ2). The mean is the sum of all genotypic means weighted by their frequencies: The total genetic variance is equal to each genotypic mean minus the overall mean (μ), squared, and then summed over all possible genotypes weighted by the appropriate genotypic frequency: The amount of genetic variance at locus 1 (or locus 2) may be calculated by subtracting the mean for each marginal genotype from the overall mean, squaring and then summing over all possible genotypes weighted by the relevant genotypic frequency: The epistatic variance is simply the amount of genetic variance not accounted for by the single-locus components: We performed this partitioning for all 51 models in Figure 4 in order to gain a better insight into the relative performance of each of the different GWA strategies. Supporting Information Dataset S1 Supplementary Results (4.9 MB XLS) Click here for additional data file.

0 comments Cited 75 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: found

Is Open Access

BOOST: A fast approach to detecting gene-gene interactions in genome-wide case-control studies

Xiang Wan, Ming-Can Yang, Shi Qiang Yang … (2010)

Gene-gene interactions have long been recognized to be fundamentally important to understand genetic causes of complex disease traits. At present, identifying gene-gene interactions from genome-wide case-control studies is computationally and methodologically challenging. In this paper, we introduce a simple but powerful method, named `BOolean Operation based Screening and Testing'(BOOST). To discover unknown gene-gene interactions that underlie complex diseases, BOOST allows examining all pairwise interactions in genome-wide case-control studies in a remarkably fast manner. We have carried out interaction analyses on seven data sets from the Wellcome Trust Case Control Consortium (WTCCC). Each analysis took less than 60 hours on a standard 3.0 GHz desktop with 4G memory running Windows XP system. The interaction patterns identified from the type 1 diabetes data set display significant difference from those identified from the rheumatoid arthritis data set, while both data sets share a very similar hit region in the WTCCC report. BOOST has also identified many undiscovered interactions between genes in the major histocompatibility complex (MHC) region in the type 1 diabetes data set. In the coming era of large-scale interaction mapping in genome-wide case-control studies, our method can serve as a computationally and statistically useful tool.

0 comments Cited 65 times – based on 0 reviews

Preprint

     Review now

Bookmark

All references

Author and article information

Journal

Journal ID (nlm-ta): Sci Rep

Journal ID (iso-abbrev): Sci Rep

Title: Scientific Reports

Publisher: Nature Publishing Group

ISSN (Electronic): 2045-2322

Publication date (Electronic): 22 January 2013

Publication date Collection: 2013

Volume: 3

Electronic Location Identifier: 1099

Affiliations

[1 ]Microsoft Research, Los Angeles , CA, USA

[2 ]Microsoft Research, Redmond , WA, USA

[3 ]These authors contributed equally to this work.

Author notes

[a ] lippert@ 123456microsoft.com

[b ] jennl@ 123456microsoft.com

[c ] heckerma@ 123456microsoft.com

Article

Publisher Item ID: srep01099

DOI: 10.1038/srep01099

PMC ID: 3551227

PubMed ID: 23346356

SO-VID: 62466bf9-6133-4a03-a6ea-fa91bb64fcd6

License:

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 3.0 Unported License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-nd/3.0/

An Exhaustive Epistatic SNP Association Analysis on Expanded Wellcome Trust Data

Read this article at

Abstract

Related collections

Association of European University Presses (AEUP)

Most cited references 10

Improved linear mixed models for genome-wide association studies.

Two-Stage Two-Locus Models in Genome-Wide Association

BOOST: A fast approach to detecting gene-gene interactions in genome-wide case-control studies

Author and article information

Journal

Affiliations

Author notes

Article

History

Categories

Comments

Comment on this article

Similar content 505

Cited by 26

Most referenced authors 1,401