8
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      Meta-imputation of transcriptome from genotypes across multiple datasets by leveraging publicly available summary-level data

      research-article
      * , *
      PLoS Genetics
      Public Library of Science

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Transcriptome wide association studies (TWAS) can be used as a powerful method to identify and interpret the underlying biological mechanisms behind GWAS by mapping gene expression levels with phenotypes. In TWAS, gene expression is often imputed from individual-level genotypes of regulatory variants identified from external resources, such as Genotype-Tissue Expression (GTEx) Project. In this setting, a straightforward approach to impute expression levels of a specific tissue is to use the model trained from the same tissue type. When multiple tissues are available for the same subjects, it has been demonstrated that training imputation models from multiple tissue types improves the accuracy because of shared eQTLs between the tissues and increase in effective sample size. However, existing joint-tissue methods require access of genotype and expression data across all tissues. Moreover, they cannot leverage the abundance of various expression datasets across various tissues for non-overlapping individuals. Here, we explore the optimal way to combine imputed levels across training models from multiple tissues and datasets in a flexible manner using summary-level data. Our proposed method (SWAM) combines arbitrary number of transcriptome imputation models to linearly optimize the imputation accuracy given a target tissue. By integrating models across tissues and/or individuals, SWAM can improve the accuracy of transcriptome imputation or to improve power to TWAS while only requiring individual-level data from a single reference cohort. To evaluate the accuracy of SWAM, we combined 49 tissue-specific gene expression imputation models from the GTEx Project as well as from a large eQTL study of Depression Susceptibility Genes and Networks (DGN) Project and tested imputation accuracy in GEUVADIS lymphoblastoid cell lines samples. We also extend our meta-imputation method to meta-TWAS to leverage multiple tissues in TWAS analysis with summary-level statistics. Our results capitalize on the importance of integrating multiple tissues to unravel regulatory impacts of genetic variants on complex traits.

          Author summary

          The gene expression levels within a cell are affected by various factors, including DNA variation, cell type, cellular microenvironment, disease status, and other environmental factors surrounding the individual. The genetic component of gene expression is known to explain a substantial fraction of transcriptional variation among individuals and can be imputed from genotypes in a tissue-specific manner, by training from population-scale transcriptomic profiles designed to identify expression quantitative loci (eQTLs). Imputing gene expression levels is shown to help understand the genetic basis of human disease through Transcriptome-wide association analysis (TWAS) and Mendelian Randomization (MR). However, it has been unclear how to integrate multiple imputation models trained from individual datasets to maximize their accuracy without having to access individual genotypes and expression levels that are often protected for privacy concerns. We developed SWAM (Smartly Weighted Averaging across Multiple datasets), a meta-imputation framework which can accurately impute gene expression levels from genotypes by integrating multiple imputation models without requiring individual-level data. Our method examines the similarity or differences between resources and borrowing information most relevant to the tissue of interest. We demonstrate that SWAM outperforms existing single-tissue and multi-tissue imputation models and continue to increase accuracy when integrating additional imputation models.

          Related collections

          Most cited references40

          • Record: found
          • Abstract: found
          • Article: found
          Is Open Access

          Mendelian randomization with invalid instruments: effect estimation and bias detection through Egger regression

          Background: The number of Mendelian randomization analyses including large numbers of genetic variants is rapidly increasing. This is due to the proliferation of genome-wide association studies, and the desire to obtain more precise estimates of causal effects. However, some genetic variants may not be valid instrumental variables, in particular due to them having more than one proximal phenotypic correlate (pleiotropy). Methods: We view Mendelian randomization with multiple instruments as a meta-analysis, and show that bias caused by pleiotropy can be regarded as analogous to small study bias. Causal estimates using each instrument can be displayed visually by a funnel plot to assess potential asymmetry. Egger regression, a tool to detect small study bias in meta-analysis, can be adapted to test for bias from pleiotropy, and the slope coefficient from Egger regression provides an estimate of the causal effect. Under the assumption that the association of each genetic variant with the exposure is independent of the pleiotropic effect of the variant (not via the exposure), Egger’s test gives a valid test of the null causal hypothesis and a consistent causal effect estimate even when all the genetic variants are invalid instrumental variables. Results: We illustrate the use of this approach by re-analysing two published Mendelian randomization studies of the causal effect of height on lung function, and the causal effect of blood pressure on coronary artery disease risk. The conservative nature of this approach is illustrated with these examples. Conclusions: An adaption of Egger regression (which we call MR-Egger) can detect some violations of the standard instrumental variable assumptions, and provide an effect estimate which is not subject to these violations. The approach provides a sensitivity analysis for the robustness of the findings from a Mendelian randomization investigation.
            Bookmark
            • Record: found
            • Abstract: found
            • Article: found
            Is Open Access

            The UK Biobank resource with deep phenotyping and genomic data

            The UK Biobank project is a prospective cohort study with deep genetic and phenotypic data collected on approximately 500,000 individuals from across the United Kingdom, aged between 40 and 69 at recruitment. The open resource is unique in its size and scope. A rich variety of phenotypic and health-related information is available on each participant, including biological measurements, lifestyle indicators, biomarkers in blood and urine, and imaging of the body and brain. Follow-up information is provided by linking health and medical records. Genome-wide genotype data have been collected on all participants, providing many opportunities for the discovery of new genetic associations and the genetic bases of complex traits. Here we describe the centralized analysis of the genetic data, including genotype quality, properties of population structure and relatedness of the genetic data, and efficient phasing and genotype imputation that increases the number of testable variants to around 96 million. Classical allelic variation at 11 human leukocyte antigen genes was imputed, resulting in the recovery of signals with known associations between human leukocyte antigen alleles and many diseases.
              Bookmark
              • Record: found
              • Abstract: found
              • Article: not found

              Adjusting batch effects in microarray expression data using empirical Bayes methods.

              Non-biological experimental variation or "batch effects" are commonly observed across multiple batches of microarray experiments, often rendering the task of combining data from these batches difficult. The ability to combine microarray data sets is advantageous to researchers to increase statistical power to detect biological phenomena from studies where logistical considerations restrict sample size or in studies that require the sequential hybridization of arrays. In general, it is inappropriate to combine data sets without adjusting for batch effects. Methods have been proposed to filter batch effects from data, but these are often complicated and require large batch sizes ( > 25) to implement. Because the majority of microarray studies are conducted using much smaller sample sizes, existing methods are not sufficient. We propose parametric and non-parametric empirical Bayes frameworks for adjusting data for batch effects that is robust to outliers in small sample sizes and performs comparable to existing methods for large samples. We illustrate our methods using two example data sets and show that our methods are justifiable, easy to apply, and useful in practice. Software for our method is freely available at: http://biosun1.harvard.edu/complab/batch/.
                Bookmark

                Author and article information

                Contributors
                Role: ConceptualizationRole: Data curationRole: Formal analysisRole: InvestigationRole: MethodologyRole: SoftwareRole: ValidationRole: VisualizationRole: Writing – original draftRole: Writing – review & editing
                Role: ConceptualizationRole: Data curationRole: Formal analysisRole: Funding acquisitionRole: InvestigationRole: MethodologyRole: ResourcesRole: SoftwareRole: SupervisionRole: ValidationRole: VisualizationRole: Writing – review & editing
                Role: Editor
                Journal
                PLoS Genet
                PLoS Genet
                plos
                PLoS Genetics
                Public Library of Science (San Francisco, CA USA )
                1553-7390
                1553-7404
                31 January 2022
                January 2022
                : 18
                : 1
                : e1009571
                Affiliations
                [001] Department of Biostatistics and Center for Statistical Genetics, University of Michigan, Ann Arbor, Michigan, United States of America
                University of Pennsylvania, UNITED STATES
                Author notes

                H.M.K. is presently an employee of Regeneron Pharmaceuticals, in which he owns stock and stock options. A.E.L. is presently an employee of Gencove Inc., in which he owns stock and stock options. Regeneron Pharmaceuticals and Gencove Inc. did not play a role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript. There are no patents, products in development or marketed products associated with this research to declare. This does not alter our adherence to PLOS policies on sharing data and materials.

                Author information
                https://orcid.org/0000-0001-5522-1263
                https://orcid.org/0000-0002-3631-3979
                Article
                PGENETICS-D-21-00570
                10.1371/journal.pgen.1009571
                8830793
                35100255
                95bd29e1-bece-409d-8efd-96496f7c52a5
                © 2022 Liu, Kang

                This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

                History
                : 26 April 2021
                : 7 January 2022
                Page count
                Figures: 4, Tables: 0, Pages: 22
                Funding
                Funded by: funder-id http://dx.doi.org/10.13039/100000050, National Heart, Lung, and Blood Institute;
                Award ID: HL137182
                Award Recipient :
                Funded by: funder-id http://dx.doi.org/10.13039/100000050, National Heart, Lung, and Blood Institute;
                Award ID: HL137182
                Award Recipient :
                Funded by: funder-id http://dx.doi.org/10.13039/100000051, National Human Genome Research Institute;
                Award ID: HG009976
                Award Recipient :
                Funded by: funder-id http://dx.doi.org/10.13039/100000062, National Institute of Diabetes and Digestive and Kidney Diseases;
                Award ID: DK082841
                Award Recipient :
                Funded by: funder-id http://dx.doi.org/10.13039/100000062, National Institute of Diabetes and Digestive and Kidney Diseases;
                Award ID: DK082841
                Award Recipient :
                Funded by: funder-id http://dx.doi.org/10.13039/100000062, National Institute of Diabetes and Digestive and Kidney Diseases;
                Award ID: DK081943
                Award Recipient :
                Funded by: funder-id http://dx.doi.org/10.13039/100000062, National Institute of Diabetes and Digestive and Kidney Diseases;
                Award ID: DK081943
                Award Recipient :
                This work was supported by NIH grants HL137182 (from NHLBI, https://www.nhlbi.nih.gov/, to A.E.L and H.M.K), HG009976 (from NHGRI, https://www.genome.gov/, to H.M.K), DK082841 (from NIDDK https://www.niddk.nih.gov/, to A,E.L and H.M.K), and DK081943 (from NIDDK https://www.niddk.nih.gov/, to A,E.L and H.M.K). The authors receiving funding were A.E.L. and H.M.K. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
                Categories
                Research Article
                Biology and Life Sciences
                Computational Biology
                Genome Analysis
                Transcriptome Analysis
                Biology and Life Sciences
                Genetics
                Genomics
                Genome Analysis
                Transcriptome Analysis
                Biology and Life Sciences
                Computational Biology
                Genome Analysis
                Genome-Wide Association Studies
                Biology and Life Sciences
                Genetics
                Genomics
                Genome Analysis
                Genome-Wide Association Studies
                Biology and Life Sciences
                Genetics
                Human Genetics
                Genome-Wide Association Studies
                Biology and Life Sciences
                Physiology
                Physiological Parameters
                Body Weight
                Biology and Life Sciences
                Genetics
                Gene Expression
                Biology and Life Sciences
                Genetics
                Heredity
                Biology and Life Sciences
                Genetics
                Social Sciences
                Sociology
                Consortia
                Biology and Life Sciences
                Genetics
                Gene Expression
                Gene Regulation
                Custom metadata
                vor-update-to-uncorrected-proof
                2022-02-10
                Software and raw data files are held in: https://github.com/aeyliu/SWAM Additional scripts can be found in: https://github.com/aeyliu/SWAM-manuscript.

                Genetics
                Genetics

                Comments

                Comment on this article