Efficient analysis of large-scale genome-wide data with two R packages: bigstatsr and bigsnpr

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

Motivation

Genome-wide datasets produced for association studies have dramatically increased in size over the past few years, with modern datasets commonly including millions of variants measured in dozens of thousands of individuals. This increase in data size is a major challenge severely slowing down genomic analyses, leading to some software becoming obsolete and researchers having limited access to diverse analysis tools.

Results

Here we present two R packages, bigstatsr and bigsnpr, allowing for the analysis of large scale genomic data to be performed within R. To address large data size, the packages use memory-mapping for accessing data matrices stored on disk instead of in RAM. To perform data pre-processing and data analysis, the packages integrate most of the tools that are commonly used, either through transparent system calls to existing software, or through updated or improved implementation of existing methods. In particular, the packages implement fast and accurate computations of principal component analysis and association studies, functions to remove single nucleotide polymorphisms in linkage disequilibrium and algorithms to learn polygenic risk scores on millions of single nucleotide polymorphisms. We illustrate applications of the two R packages by analyzing a case–control genomic dataset for celiac disease, performing an association study and computing polygenic risk scores. Finally, we demonstrate the scalability of the R packages by analyzing a simulated genome-wide dataset including 500 000 individuals and 1 million markers on a single desktop computer.

Availability and implementation

https://privefl.github.io/bigstatsr/ and https://privefl.github.io/bigsnpr/.

Supplementary information

Supplementary data are available at Bioinformatics online.

Related collections

Most cited references 16

Record: found
Abstract: found
Article: not found

Fast Principal-Component Analysis Reveals Convergent Evolution of ADH1B in Europe and East Asia.

Kevin J Galinsky, Gaurav Bhatia, Po-Ru Loh … (2016)

Searching for genetic variants with unusual differentiation between subpopulations is an established approach for identifying signals of natural selection. However, existing methods generally require discrete subpopulations. We introduce a method that infers selection using principal components (PCs) by identifying variants whose differentiation along top PCs is significantly greater than the null distribution of genetic drift. To enable the application of this method to large datasets, we developed the FastPCA software, which employs recent advances in random matrix theory to accurately approximate top PCs while reducing time and memory cost from quadratic to linear in the number of individuals, a computational improvement of many orders of magnitude. We apply FastPCA to a cohort of 54,734 European Americans, identifying 5 distinct subpopulations spanning the top 4 PCs. Using the PC-based test for natural selection, we replicate previously known selected loci and identify three new genome-wide significant signals of selection, including selection in Europeans at ADH1B. The coding variant rs1229984(∗)T has previously been associated to a decreased risk of alcoholism and shown to be under selection in East Asians; we show that it is a rare example of independent evolution on two continents. We also detect selection signals at IGFBP3 and IGH, which have also previously been associated to human disease.

0 comments Cited 174 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: found

Is Open Access

Fast Principal Component Analysis of Large-Scale Genome-Wide Data

Gad Abraham, Michael Inouye (2014)

Principal component analysis (PCA) is routinely used to analyze genome-wide single-nucleotide polymorphism (SNP) data, for detecting population structure and potential outliers. However, the size of SNP datasets has increased immensely in recent years and PCA of large datasets has become a time consuming task. We have developed flashpca, a highly efficient PCA implementation based on randomized algorithms, which delivers identical accuracy in extracting the top principal components compared with existing tools, in substantially less time. We demonstrate the utility of flashpca on both HapMap3 and on a large Immunochip dataset. For the latter, flashpca performed PCA of 15,000 individuals up to 125 times faster than existing tools, with identical results, and PCA of 150,000 individuals using flashpca completed in 4 hours. The increasing size of SNP datasets will make tools such as flashpca essential as traditional approaches will not adequately scale. This approach will also help to scale other applications that leverage PCA or eigen-decomposition to substantially larger datasets.

0 comments Cited 142 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: found

Is Open Access

Complex disease and phenotype mapping in the domestic dog

Jessica Hayward, Marta G P Castelhano, Kyle Oliveira … (2016)

The domestic dog is becoming an increasingly valuable model species in medical genetics, showing particular promise to advance our understanding of cancer and orthopaedic disease. Here we undertake the largest canine genome-wide association study to date, with a panel of over 4,200 dogs genotyped at 180,000 markers, to accelerate mapping efforts. For complex diseases, we identify loci significantly associated with hip dysplasia, elbow dysplasia, idiopathic epilepsy, lymphoma, mast cell tumour and granulomatous colitis; for morphological traits, we report three novel quantitative trait loci that influence body size and one that influences fur length and shedding. Using simulation studies, we show that modestly larger sample sizes and denser marker sets will be sufficient to identify most moderate- to large-effect complex disease loci. This proposed design will enable efficient mapping of canine complex diseases, most of which have human homologues, using far fewer samples than required in human studies.

0 comments Cited 129 times – based on 0 reviews      Review now

Bookmark

All references

Author and article information

Contributors

Oliver Stegle: Role: Associate Editor

Journal

Journal ID (nlm-ta): Bioinformatics

Journal ID (iso-abbrev): Bioinformatics

Journal ID (publisher-id): bioinformatics

Title: Bioinformatics

Publisher: Oxford University Press

ISSN (Print): 1367-4803

ISSN (Electronic): 1367-4811

Publication date (Print): 15 August 2018

Publication date (Electronic): 30 March 2018

Publication date PMC-release: 30 March 2018

Volume: 34

Issue: 16

Pages: 2781-2787

Affiliations

[1 ]Laboratoire TIMC-IMAG, UMR 5525, CNRS, Université Grenoble Alpes, Grenoble, France

[2 ]Centre de Bioinformatique, Biostatistique et Biologie Intégrative (C3BI), Institut Pasteur, Paris, France

[3 ]Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, MA, USA

Author notes

To whom correspondence should be addressed. Email: florian.prive@ 123456univ-grenoble-alpes.fr or michael.blum@ 123456univ-grenoble-alpes.fr

Article

Publisher ID: bty185

DOI: 10.1093/bioinformatics/bty185

PMC ID: 6084588

PubMed ID: 29617937

SO-VID: 0d5e0b1b-0fd0-4b67-ab3d-cdf07e9da09a

License:

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License ( http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com

History

Date received : 06 October 2017

Date revision received : 02 February 2018

Date accepted : 29 March 2018

Page count

Pages: 7

Funding

Funded by: LabEx PERSYVAL-Lab

Award ID: ANR-11-LABX-0025-01

Funded by: Grenoble Alpes Data Institute

Funded by: French National Research Agency 10.13039/501100001665

Award ID: ANR-15-IDEX-02

Comments

Comment on this article

scite_

315

390

Smart Citations

315

390

Citing PublicationsSupportingMentioningContrasting

View Citations

See how this article has been cited at scite.ai

scite shows how a scientific paper has been cited by providing the context of the citation, a classification describing whether it supports, mentions, or contrasts the cited claim, and a label indicating in which section the citation was made.

Cited by 129

See all cited by

Most referenced authors 1,020

See all reference authors

Efficient analysis of large-scale genome-wide data with two R packages: bigstatsr and bigsnpr

Read this article at

Abstract

Motivation

Results

Availability and implementation

Supplementary information

Related collections

Databases and Data Resources for Drug Repurposing (REPO4EU)

Most cited references 16

Fast Principal-Component Analysis Reveals Convergent Evolution of ADH1B in Europe and East Asia.

Fast Principal Component Analysis of Large-Scale Genome-Wide Data

Complex disease and phenotype mapping in the domestic dog

Author and article information

Contributors

Journal

Affiliations

Author notes

Article

History

Page count

Funding

Categories

Comments

Comment on this article

Similar content 128

Cited by 129

Most referenced authors 1,020