Epigenetics is being increasingly combined with epidemiology to add mechanistic understanding
to associations observed between environmental, genetic and stochastic factors and
human disease phenotypes. Currently, epigenetic epidemiological studies primarily
focus on exploring if and where the epigenome (i.e. the overall epigenetic state of
a cell) is influenced by specific environmental exposures like prenatal nutrition,
1
sun exposure
2
and smoking.
3
In this issue of the IJE, Nada Borghol et al.
4
report an association between childhood social-economic status (SES) and differential
DNA methylation in adulthood. Low SES may integrate diverse and heterogeneous environmental
influences, and knowing which epigenetic changes are associated with low SES may provide
clues about the biological processes underlying its health consequences. The authors
stress that their study is preliminary. This statement is, in fact, to a greater or
lesser extent applicable to the entire first wave of studies currently being published
that likewise aim to discover associations between epigenetic variation measured on
a genome-wide scale and environmental exposures or disease phenotypes. When executing
such epigenome-wide association studies (EWASs),
5
every epigenetic epidemiologist is struggling with the same biological, technical
and methodological issues. It is important to take these into consideration when designing
a study and interpreting the results. Let us consider seven of those issues, taking
the current study on SES as a starting point.
We do not really know where to look, or what to look for
Most epigenetic epidemiological studies focus on DNA methylation for various practical
and biological reasons, neglecting other layers of the epigenome-like histone modifications
that are also likely to be important in influencing disease phenotypes. Our basic
understanding of the methylome (i.e. the whole of DNA methylation marks on the genome)
is in its infancy, and we are still learning about the specific localization of the
features that, when differentially methylated, regulate gene expression and are thus
relevant for epigenetic epidemiologists to study. The current study, like many others,
evaluated promoter regions, in this case defined as 1000 bp upstream to 250 bp downstream
of transcription start sites. Although these features are often enriched for DNA methylation
marks influencing the expression of genes, recent work suggests that other regions
of the methylome outside of promoters, including inter-genic CpG island shores
6
and intra-genic CpG islands,
7
may ultimately be more important for regulating phenotypic variation.
For any differentially methylated region identified in EWASs it will be important
to demonstrate functionality. Promoter methylation in the current study was integrated
with public gene expression data and, as expected, highly expressed genes were more
commonly flanked by less methylated promoters and vice versa. A limitation is that
this observation is for groups of promoters, whereas information is needed about this
relationship for individual promoters. Mining the reference epigenomes and transcriptomes
that are being generated for different cell types under the umbrella of initiatives
such as the National Institutes of Health (NIH) Epigenomics Roadmap
8
and the International Human Epigenome Consortium
9
may contribute to such information. Additional in vitro experiments will be required
to evaluate the transcriptional effects of differential DNA methylation at a specific
locus independent of its genomic context.
10
We have to rely on imperfect technology
The good news is that recent advances in genomic technology mean that genome-scale
studies of DNA methylation across multiple samples are now feasible. In practice,
however, one has to compromise between coverage and precision in epidemiological studies,
which likely incorporate a large number of samples. A large (and growing) number of
methods exist for assessing DNA methylation both genome wide and at specific CpG sites,
11
and one problem relates to our inability to compare results across studies that have
used different platforms. On the one hand there are methods such as that used in the
current study in which the methylated portion of the genome is captured using antibodies
against methylated DNA and subsequently quantified using microarrays or next-generation
sequencing. These approaches can provide coverage across most of the genome and may
be optimally suited to discriminate low from high methylation, but have lower reliability
for smaller differences and are biased by factors such as CG density.
12
,
13
On the other hand, there are methods based on the bisulphite conversion of DNA combined
with next-generation sequencing that provide higher accuracy and single nucleotide
resolution. Although whole-genome bisulphite sequencing is currently unfeasible to
use across large epidemiological cohorts, the method can be adapted to target a reduced
representation of the genome (approximately 3 million out of approximately 28 million
CG dinucleotides in the human genome).
12
,
13
The recently launched Illumina 450 k Methylation Beadchip may offer a balance between
coverage and precision, which will be attractive for epidemiological EWASs executed
during the next few years.
5
It interrogates DNA methylation at over 480 000 CG dinucleotides, is high-throughput
and relatively affordable. The precision of this platform appears to compare well
with some of the other platforms,
12
,
13
but these results should be interpreted with caution. Although correlation coefficients
reported across the various platform comparisons are high, they are mainly driven
by the fact that the large majority of the genome is either unmethylated or fully
methylated, and substantial discrepancies between platforms may exist for intermediate
level methylation.
12
,
14
Therefore, the technological validation of findings using an independent method remains
important. This will be feasible for a small number of ‘top hits’, like the three
procadherin promoters assessed in the current study. However, validating the outcomes
of the complex pathway analyses performed to implicate either entire biological processes
(such as extra- and intra-cellular signalling in the current study) or genomic features
with a specific function in gene regulation [e.g. promoters, enhancers, inter/intragenic
CG island (shores) etc.], is more demanding and currently not realized. Validating
the results of such gene-set testing methods will entail the re-assessment of DNA
methylation across large sets of loci.
We may be limited by available sample sizes that are optimal for epigenetic epidemiology
The current study investigated only 40 individuals. Investigators will be able to
secure budgets for larger studies as empirical data increasingly highlight the value
of epigenetic epidemiology, and high-throughput, economical laboratory approaches
become more widely adopted. Nevertheless, it is unlikely that the simple brute-force
approach that has been used relatively successfully in genome-wide association studies
(GWASs) is valid for EWASs. In genetics, many of the epidemiological principles about
designing studies with respect to selection biases, confounding, batch effects and
appropriateness of controls could largely be replaced by the simple rule ‘bigger-is-better’.
This is not true for epigenetic epidemiology, because the epigenome is not a static
entity like the genome, which necessitates the use of more conventional epidemiological
approaches.
15
Further complicating matters is the fact that, for the most powerful study designs
in epigenetic epidemiology (including studies of discordant monozygotic twins
16
particularly when longitudinally sampled,
17
early exposure studies with long-term follow-up,
1
and studies of specific cell types
18
), the number of eligible individuals for whom relevant biological materials were
stored in existing epidemiological cohorts were often limited, and it will be difficult
to scale-up analyses to include the thousands of samples that may be required for
establishing robust associations with disease phenotypes. Moving forward, it will
be important to establish cause and effect in epigenetic epidemiology; disease-associated
differentially methylated regions may arise prior to illness and contribute to the
disease phenotype or could be a secondary effect of the disease process, or the medications
used in treatment.
19
Furthermore, maximum information will be obtained from epidemiological studies that
are able to integrate epigenomic information with genomic, transcriptomic and proteomic
data obtained from the same samples.
Whatever we do, it may never be enough to fully account for epigenetic differences
between tissues and cells
In many respects, large comprehensively phenotyped and longitudinally sampled epidemiological
studies, like the 1958 British birth cohort used in the current study, are an ideal
resource for epigenetic epidemiology. In nearly all of these studies, however, whole
blood is the only biological material that has been archived. Blood is a heterogeneous
tissue and any DNA methylation difference between groups could be confounded by differences
in the cellular composition of whole blood samples, for example, resulting from the
immune response to sub-clinical infection. The good news is that fewer than perhaps
expected DNA methylation differences exist between leucocyte types, and controlling
for cellular heterogeneity may be possible in biobanks with a simple blood cell count.
20
Whether the latter is sufficient (and under which circumstances it is not), however,
remains to be established. Epigenomic studies of separate cell types such as those
being undertaken by the NIH Epigenomic Roadmap Initiative and the European Union Blueprint
consortium are currently generating reference epigenomes of haematopoietic cells that
will be of great utility in this regard.
8
When moving beyond associations with environmental exposures to epigenetic associations
with phenotypes, a key question for epigenetic epidemiology concerns the extent to
which easily accessible peripheral tissues (such as blood) can be used to ask questions
about inter-individual phenotypic variation manifest in inaccessible tissues such
as the brain, visceral fat and other internal organs and tissues. Cross-tissue comparisons
of the methylome within the same individual are currently underway to establish the
relationship between epigenetic patterns in blood with other tissues. Although these
analyses are crucial, the results may not be generally applicable; higher inter-tissue
concordance may be present for DNA methylation changes induced early in development
(and potentially propagated soma-wide) than for changes occur during ageing that are
more likely to remain tissue specific.
19
,
21
Efforts to obtain biopsies (subcutaneous fat, muscle, etc.) and post-mortem material
in subsets of longitudinal biobanks will greatly increase their value for epigenetic
studies, despite the problems associated with cellular heterogeneity that also hold
for such samples.
We may be trying to detect inherently small effect sizes using these sub-optimal methods
and sample cohorts
The main findings in the current study concerned DNA methylation differences at three
procadherin promoters.
4
The extent of the difference at these promoters was similar to those commonly observed
in other recent studies, namely ~5%,
5
and was most apparent for a single, nominally statistically significant CG dinucleotide
in each region. The biological implications of such small alterations in DNA methylation
in terms of gene expression and function are unknown. Although DNA methylation is
recognized as one of the most stable epigenetic marks, it is still relatively dynamic
and this has important implications for epigenetic epidemiology. The randomness of
maintaining and mitotically transmitting DNA methylation patterns may potentially
dilute the putative epigenetic signatures of an adverse exposure early in life (e.g.
to low SES in childhood) observed decades later. Of note, recent studies indicate
that DNA methylation patterns in leucocytes undergo considerable changes during the
first years of life.
22
Thus on top of the previously discussed question of whether DNA methylation at a specific
locus actually influences transcriptional activity, researchers should also aim to
establish whether the small DNA methylation differences often observed between groups—either
expressed as absolute difference, relative difference or relative to the variation
in the population—translate into differences in gene expression in the relevant tissue.
It will be of particular interest to see whether the effects of such modest differences,
while perhaps of little consequence individually, may shift transcription of a biological
process or functional network when they co-occur with other changes to the methylome.
23
Little is known about the actual scale and extent of between-individual variation
in DNA methylation across the genome. In this regard, public genome-scale resources
need to be created that document inter-individual differences in DNA methylation and
gene expression, in addition to the reference epigenomes that are currently being
generated.
We lack a framework for the analysis of genome-wide epigenetic data
The results of GWASs are relatively easy to judge. Quality-control steps are well-defined
and reported, individually testing every genetic variant [i.e. single nucleotide polymorphism
(SNP)] is straightforward, and levels of genome-wide statistical significance are
clear. For EWASs, the analytical methodology is very much under construction. For
example, in the current study it was not possible to attain genome-wide levels of
significance, which is acceptable for an exploratory study, but makes it difficult
to fully interpret the reported differences. Because of the vast range of methods
currently being used to assess DNA methylation, meta-analyses across different studies
are difficult. The adoption of a common technology platform, such as the new Illumina
450 k Methylation Beadchip, across multiple studies would provide an excellent opportunity
to converge on widely accepted guidelines for the analysis and integration of EWAS
data. Apart from pre-processing procedures (quality control, normalization, handling
different probe types, accounting for genetic variation, etc.), elements of these
guidelines should deal with the analysis of individual CG dinucleotides vs groups
of (correlated) adjacent CGs, the use of genome annotations in the analysis (histone
states, promoter types, CG content, etc.), and levels of epigenome-wide significance
for various analyses. An important aspect will be the exploration of the previously
mentioned gene-set testing methods in the context of DNA methylation since they will
be vital to obtain meaningful interpretations of genome-wide data in terms of underlying
biological processes or genomic functions [e.g. promoters, enhancers, inter/intragenic
CG island (shores), etc.]. For example, commonly used enrichment methods assume independence
within a gene set and, apart from consistency in biological signal in a gene set,
statistical significance may reflect consistency in other characteristics such as
GC content, coverage or other sequence features.
24
Alternative implementations of gene-set testing methods include global testing approaches.
25
Finally, it will be important to adopt an integrative paradigm based on the combination
of genetic and epigenetic epidemiological data.
26
Of particular relevance in this respect is evidence for the widespread occurrence
of allele-specific DNA methylation (ASM) across the genome. Recent studies have shown
that there are considerable inter-individual differences in ASM, which are frequently
associated with genetic variation but can also be mediated by genomic imprinting (i.e.
the parent-of-origin dependent silencing of expression by epigenetic mechanisms),
environmental influences and apparently stochastic factors in the cell.
27
,
28
ASM can mask the effect of risk alleles by silencing their expression, and also provides
a potential mechanism underlying gene–environment interactions.
26
Furthermore, ASM may contribute towards the apparent ‘missing heritability’ of many
complex diseases and the low penetrance often reported for SNPs identified by GWASs.
29
We have to manage high expectations
There is a considerable interest in epigenetic research in the popular press. The
current study is a vivid illustration: even though the authors deem it preliminary,
it was widely covered by the media.
30
Epigenetics should avoid some of the hype that surrounded the early days of genetic
epidemiology. After the draft human genome sequence was announced in 2001, it was
widely perceived that we would soon understand the causes of most common diseases
and how to treat them. This expectation was not realistic, but not always renounced
by geneticists. Currently, many scientists outside the field are disappointed by results
of human genetics, and in particular GWASs, despite their overall considerable success.
Genetic epidemiology has proven to be harder than expected despite the favourable
starting point of thousands of Mendelian diseases and the high heritabilities associated
with most traits to be explained. Very much like genetics, epigenetics will not be
able to deliver the miracles it is sometimes claimed it will.
In conclusion, epigenetic epidemiology is early in its development and susceptible
to new ideas and approaches. Only a few years ago empirical papers were greatly outnumbered
by reviews. Now, reference epigenomes are produced at great pace (see http://epigenomeatlas.org).
8
,
9
Moreover, furthered by pilot studies like the one from Nada Borghol et al.,
4
the outline of the infrastructure required for EWASs is emerging. Crucial elements
include optimal study designs, benchmarking technology and data analysis approaches
that are statistically and biologically sound. An additional key aspect to the successful
design and interpretation of epigenetic epidemiological studies will be the creation
of public genome-scale resources focusing on inter-individual variation incorporating
epigenomic, DNA sequence and transcriptomic data. Education, hard work and a certain
degree of luck will get us there—not very different to the remedy against low SES.
Funding
NGI/NWO (#93518027, to B.T.H.); NGI/NWO-funded Netherlands Consortium for Healthy
Ageing (NCHA) (#05060810, B.T.H.); NIH grant (AG036039, to J.M.).