Analyses and Comparison of Accuracy of Different Genotype Imputation Methods

Pei, Yu-Fang; Li, Jian Jian; Zhang, Lei; Papasian, Christopher J.; Deng, Hong-Wen

doi:10.1371/journal.pone.0003551

ScienceOpen: research and publishing network

For Publishers

For Researchers

Blog
About

Search
Advanced search

views

recommends

Record: found
Abstract: found
Article: not found

Analyses and Comparison of Accuracy of Different Genotype Imputation Methods

research-article

Author(s): Yu-Fang Pei ¹ ^, ² , Jian Li ² , Lei Zhang ¹ ^, ² , Christopher J. Papasian ² , Hong-Wen Deng ¹ ^, ² ^, ³ ^, ^*

Editor(s): Peter Heutink

Publication date (Electronic): 29 October 2008

Journal: PLoS ONE

Publisher: Public Library of Science

Read this article at

ScienceOpenPublisher PMC

Bookmark

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

The power of genetic association analyses is often compromised by missing genotypic data which contributes to lack of significant findings, e.g., in in silico replication studies. One solution is to impute untyped SNPs from typed flanking markers, based on known linkage disequilibrium (LD) relationships. Several imputation methods are available and their usefulness in association studies has been demonstrated, but factors affecting their relative performance in accuracy have not been systematically investigated. Therefore, we investigated and compared the performance of five popular genotype imputation methods, MACH, IMPUTE, fastPHASE, PLINK and Beagle, to assess and compare the effects of factors that affect imputation accuracy rates (ARs). Our results showed that a stronger LD and a lower MAF for an untyped marker produced better ARs for all the five methods. We also observed that a greater number of haplotypes in the reference sample resulted in higher ARs for MACH, IMPUTE, PLINK and Beagle, but had little influence on the ARs for fastPHASE. In general, MACH and IMPUTE produced similar results and these two methods consistently outperformed fastPHASE, PLINK and Beagle. Our study is helpful in guiding application of imputation methods in association analyses when genotype data are missing.

Related collections

Most cited references 27

Record: found
Abstract: found
Article: not found

A fast and flexible statistical model for large-scale population genotype data: applications to inferring missing genotypes and haplotypic phase.

Paul Scheet, Matthew Stephens (2006)

We present a statistical model for patterns of genetic variation in samples of unrelated individuals from natural populations. This model is based on the idea that, over short regions, haplotypes in a population tend to cluster into groups of similar haplotypes. To capture the fact that, because of recombination, this clustering tends to be local in nature, our model allows cluster memberships to change continuously along the chromosome according to a hidden Markov model. This approach is flexible, allowing for both "block-like" patterns of linkage disequilibrium (LD) and gradual decline in LD with distance. The resulting model is also fast and, as a result, is practicable for large data sets (e.g., thousands of individuals typed at hundreds of thousands of markers). We illustrate the utility of the model by applying it to dense single-nucleotide-polymorphism genotype data for the tasks of imputing missing genotypes and estimating haplotypic phase. For imputing missing genotypes, methods based on this model are as accurate or more accurate than existing methods. For haplotype estimation, the point estimates are slightly less accurate than those from the best existing methods (e.g., for unrelated Centre d'Etude du Polymorphisme Humain individuals from the HapMap project, switch error was 0.055 for our method vs. 0.051 for PHASE) but require a small fraction of the computational cost. In addition, we demonstrate that the model accurately reflects uncertainty in its estimates, in that probabilities computed using the model are approximately well calibrated. The methods described in this article are implemented in a software package, fastPHASE, which is available from the Stephens Lab Web site.

0 comments Cited 759 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

Newly identified loci that influence lipid concentrations and risk of coronary artery disease.

Cristen J Willer, Serena Sanna, Anne U. Jackson … (2008)

To identify genetic variants influencing plasma lipid concentrations, we first used genotype imputation and meta-analysis to combine three genome-wide scans totaling 8,816 individuals and comprising 6,068 individuals specific to our study (1,874 individuals from the FUSION study of type 2 diabetes and 4,184 individuals from the SardiNIA study of aging-associated variables) and 2,758 individuals from the Diabetes Genetics Initiative, reported in a companion study in this issue. We subsequently examined promising signals in 11,569 additional individuals. Overall, we identify strongly associated variants in eleven loci previously implicated in lipid metabolism (ABCA1, the APOA5-APOA4-APOC3-APOA1 and APOE-APOC clusters, APOB, CETP, GCKR, LDLR, LPL, LIPC, LIPG and PCSK9) and also in several newly identified loci (near MVK-MMAB and GALNT2, with variants primarily associated with high-density lipoprotein (HDL) cholesterol; near SORT1, with variants primarily associated with low-density lipoprotein (LDL) cholesterol; near TRIB1, MLXIPL and ANGPTL3, with variants primarily associated with triglycerides; and a locus encompassing several genes near NCAN, with variants strongly associated with both triglycerides and LDL cholesterol). Notably, the 11 independent variants associated with increased LDL cholesterol concentrations in our study also showed increased frequency in a sample of coronary artery disease cases versus controls.

0 comments Cited 486 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

Efficiency and power in genetic association studies.

Paul de Bakker, Roman Yelensky, Itsik Pe'er … (2005)

We investigated selection and analysis of tag SNPs for genome-wide association studies by specifically examining the relationship between investment in genotyping and statistical power. Do pairwise or multimarker methods maximize efficiency and power? To what extent is power compromised when tags are selected from an incomplete resource such as HapMap? We addressed these questions using genotype data from the HapMap ENCODE project, association studies simulated under a realistic disease model, and empirical correction for multiple hypothesis testing. We demonstrate a haplotype-based tagging method that uniformly outperforms single-marker tests and methods for prioritization that markedly increase tagging efficiency. Examining all observed haplotypes for association, rather than just those that are proxies for known SNPs, increases power to detect rare causal alleles, at the cost of reduced power to detect common causal alleles. Power is robust to the completeness of the reference panel from which tags are selected. These findings have implications for prioritizing tag SNPs and interpreting association studies.

0 comments Cited 403 times – based on 0 reviews      Review now

Bookmark

All references

Author and article information

Contributors

: Role: Editor

Journal

Journal ID (nlm-ta): PLoS ONE

Journal ID (publisher-id): plos

Journal ID (pmc): plosone

Title: PLoS ONE

Publisher: Public Library of Science (San Francisco, USA )

ISSN (Electronic): 1932-6203

Publication date Collection: 2008

Publication date (Electronic): 29 October 2008

Volume: 3

Issue: 10

Electronic Location Identifier: e3551

Affiliations

[1 ]Institute of Molecular Genetics, School of Life Science and Technology, Xi'an Jiaotong University, Xi'an, Shaanxi, People's Republic of China

[2 ]School of Medicine, University of Missouri-Kansas City, Kansas City, Missouri, United States of America

[3 ]Laboratory of Molecular and Statistical Genetics, College of Life Sciences, Hunan Normal University, Changsha, Hunan, People's Republic of China

Vrije Universiteit Medical Centre, Netherlands

Author notes

* E-mail: hwdeng@ 123456mail.xjtu.edu.cn

Conceived and designed the experiments: YFP HWD. Performed the experiments: YFP. Analyzed the data: YFP LZ. Contributed reagents/materials/analysis tools: JL. Wrote the paper: YFP. Helped revise the paper: HWD JL LZ CP.

Article

Publisher ID: 08-PONE-RA-05131R1

DOI: 10.1371/journal.pone.0003551

PMC ID: 2569208

PubMed ID: 18958166

SO-VID: 862e3982-6042-40c8-8a38-d17540c663c2

Copyright © Pei et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

History

Date received : 17 June 2008

Date accepted : 30 September 2008

Page count

Pages: 7

Comments

Comment on this article

scite_

Cited by 48

See all cited by

Most referenced authors 1,371

See all reference authors

Analyses and Comparison of Accuracy of Different Genotype Imputation Methods

Read this article at

Abstract

Related collections

PLOS Climate

Most cited references 27

A fast and flexible statistical model for large-scale population genotype data: applications to inferring missing genotypes and haplotypic phase.

Newly identified loci that influence lipid concentrations and risk of coronary artery disease.

Efficiency and power in genetic association studies.

Author and article information

Contributors

Journal

Affiliations

Author notes

Article

History

Page count

Categories

Comments

Comment on this article

Similar content 222

Cited by 48

Most referenced authors 1,371