The GATK joint genotyping workflow is appropriate for calling variants in RNA-seq experiments

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

The Genome Analysis Toolkit (GATK) is a popular set of programs for discovering and genotyping variants from next-generation sequencing data. The current GATK recommendation for RNA sequencing (RNA-seq) is to perform variant calling from individual samples, with the drawback that only variable positions are reported. Versions 3.0 and above of GATK offer the possibility of calling DNA variants on cohorts of samples using the HaplotypeCaller algorithm in Genomic Variant Call Format (GVCF) mode. Using this approach, variants are called individually on each sample, generating one GVCF file per sample that lists genotype likelihoods and their genome annotations. In a second step, variants are called from the GVCF files through a joint genotyping analysis. This strategy is more flexible and reduces computational challenges in comparison to the traditional joint discovery workflow. Using a GVCF workflow for mining SNP in RNA-seq data provides substantial advantages, including reporting homozygous genotypes for the reference allele as well as missing data. Taking advantage of RNA-seq data derived from primary macrophages isolated from 50 cows, the GATK joint genotyping method for calling variants on RNA-seq data was validated by comparing this approach to a so-called “per-sample” method. In addition, pair-wise comparisons of the two methods were performed to evaluate their respective sensitivity, precision and accuracy using DNA genotypes from a companion study including the same 50 cows genotyped using either genotyping-by-sequencing or with the Bovine SNP50 Beadchip (imputed to the Bovine high density). Results indicate that both approaches are very close in their capacity of detecting reference variants and that the joint genotyping method is more sensitive than the per-sample method. Given that the joint genotyping method is more flexible and technically easier, we recommend this approach for variant calling in RNA-seq experiments.

Electronic supplementary material

The online version of this article (10.1186/s40104-019-0359-0) contains supplementary material, which is available to authorized users.

Related collections

Most cited references 11

Record: found
Abstract: found
Article: not found

The study of eQTL variations by RNA-seq: from SNPs to phenotypes.

Jacek Majewski, Tomi Pastinen (2011)

Common DNA variants alter the expression levels and patterns of many human genes. Loci responsible for this genetic control are known as expression quantitative trait loci (eQTLs). The resulting variation of gene expression across individuals has been postulated to be a determinant of phenotypic variation and susceptibility to complex disease. In the past, the application of expression microarray and genetic variation data to study populations enabled the rapid identification of eQTLs in model organisms and humans. Now, a new technology promises to revolutionize the field. Massively parallel RNA sequencing (RNA-seq) provides unprecedented resolution, allowing us to accurately monitor not only the expression output of each genomic locus but also reconstruct and quantify alternatively spliced transcripts. RNA-seq also provides new insights into the regulatory mechanisms underlying eQTLs. Here, we discuss the major advances introduced by RNA-seq and summarize current progress towards understanding the role of eQTLs in determining human phenotypic diversity. Copyright © 2010 Elsevier Ltd. All rights reserved.

0 comments Cited 98 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: found

Is Open Access

Variant Callers for Next-Generation Sequencing Data: A Comparison Study

Xiangtao Liu, Shizhong Han, Zuoheng Wang … (2013)

Next generation sequencing (NGS) has been leading the genetic study of human disease into an era of unprecedented productivity. Many bioinformatics pipelines have been developed to call variants from NGS data. The performance of these pipelines depends crucially on the variant caller used and on the calling strategies implemented. We studied the performance of four prevailing callers, SAMtools, GATK, glftools and Atlas2, using single-sample and multiple-sample variant-calling strategies. Using the same aligner, BWA, we built four single-sample and three multiple-sample calling pipelines and applied the pipelines to whole exome sequencing data taken from 20 individuals. We obtained genotypes generated by Illumina Infinium HumanExome v1.1 Beadchip for validation analysis and then used Sanger sequencing as a “gold-standard” method to resolve discrepancies for selected regions of high discordance. Finally, we compared the sensitivity of three of the single-sample calling pipelines using known simulated whole genome sequence data as a gold standard. Overall, for single-sample calling, the called variants were highly consistent across callers and the pairwise overlapping rate was about 0.9. Compared with other callers, GATK had the highest rediscovery rate (0.9969) and specificity (0.99996), and the Ti/Tv ratio out of GATK was closest to the expected value of 3.02. Multiple-sample calling increased the sensitivity. Results from the simulated data suggested that GATK outperformed SAMtools and glfSingle in sensitivity, especially for low coverage data. Further, for the selected discrepant regions evaluated by Sanger sequencing, variant genotypes called by exome sequencing versus the exome array were more accurate, although the average variant sensitivity and overall genotype consistency rate were as high as 95.87% and 99.82%, respectively. In conclusion, GATK showed several advantages over other variant callers for general purpose NGS analyses. The GATK pipelines we developed perform very well.

0 comments Cited 75 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: found

Is Open Access

Strategies for imputation to whole genome sequence using a single or multi-breed reference population in cattle

Rasmus Froberg Brøndum, Bernt Guldbrandtsen, Goutam Sahana … (2014)

Background The advent of low cost next generation sequencing has made it possible to sequence a large number of dairy and beef bulls which can be used as a reference for imputation of whole genome sequence data. The aim of this study was to investigate the accuracy and speed of imputation from a high density SNP marker panel to whole genome sequence level. Data contained 132 Holstein, 42 Jersey, 52 Nordic Red and 16 Brown Swiss bulls with whole genome sequence data; 16 Holstein, 27 Jersey and 29 Nordic Reds had previously been typed with the bovine high density SNP panel and were used for validation. We investigated the effect of enlarging the reference population by combining data across breeds on the accuracy of imputation, and the accuracy and speed of both IMPUTE2 and BEAGLE using either genotype probability reference data or pre-phased reference data. All analyses were done on Bovine autosome 29 using 387,436 bi-allelic variants and 13,612 SNP markers from the bovine HD panel. Results A combined breed reference population led to higher imputation accuracies than did a single breed reference. The highest accuracy of imputation for all three test breeds was achieved when using BEAGLE with un-phased reference data (mean genotype correlations of 0.90, 0.89 and 0.87 for Holstein, Jersey and Nordic Red respectively) but IMPUTE2 with un-phased reference data gave similar accuracies for Holsteins and Nordic Red. Pre-phasing the reference data only lead to a minor decrease in the imputation accuracy, but gave a large improvement in computation time. Pre-phasing with BEAGLE was substantially faster than pre-phasing with SHAPEIT2 (2.5 hours vs. 52 hours for 242 individuals), and imputation with pre-phased data was faster in IMPUTE2 than in BEAGLE (5 minutes vs. 50 minutes per individual). Conclusion Combining reference populations across breeds is a good option to increase the size of the reference data and in turn the accuracy of imputation when only few animals are available. Pre-phasing the reference data only slightly decreases the accuracy but gives substantial improvements in speed. Using BEAGLE for pre-phasing and IMPUTE2 for imputation is a fast and accurate strategy.

0 comments Cited 67 times – based on 0 reviews      Review now

Bookmark

All references

Author and article information

Contributors

Jean-Simon Brouard: jean-simon.brouard@canada.ca

Flavio Schenkel: schenkel@uoguelph.ca

Andrew Marete: andrew.marete@canada.ca

Nathalie Bissonnette: nathalie.bissonnette@canada.ca

Journal

Journal ID (nlm-ta): J Anim Sci Biotechnol

Journal ID (iso-abbrev): J Anim Sci Biotechnol

Title: Journal of Animal Science and Biotechnology

Publisher: BioMed Central (London )

ISSN (Print): 1674-9782

ISSN (Electronic): 2049-1891

Publication date (Electronic): 21 June 2019

Publication date PMC-release: 21 June 2019

Publication date Collection: 2019

Volume: 10

Electronic Location Identifier: 44

Affiliations

[1 ]ISNI 0000 0001 1302 4958, GRID grid.55614.33, Sherbrooke Research and Development Centre, , Agriculture and Agri-Food Canada, ; Sherbrooke, QC J1M 0C8 Canada

[2 ]ISNI 0000 0004 1936 8198, GRID grid.34429.38, Center of Genetic Improvement of Livestock, , University of Guelph, ; Guelph, ON N1G 2W1 Canada

Article

Publisher ID: 359

DOI: 10.1186/s40104-019-0359-0

PMC ID: 6587293

PubMed ID: 31249686

SO-VID: 20c00497-af24-4245-9c18-68bfd4a7c07b

License:

Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License ( http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

History

Date received : 26 December 2018

Date accepted : 28 April 2019

Funding

Funded by: Agriculture and Agri-Food Canada

Award ID: J000075

Award Recipient : Nathalie Bissonnette

Custom metadata

ScienceOpen disciplines: Animal science & Zoology

Keywords: gatk,gvcf,joint genotyping,rna-seq,snp

Data availability:

ScienceOpen disciplines: Animal science & Zoology

Keywords: gatk, gvcf, joint genotyping, rna-seq, snp

Comments

Comment on this article

scite_

Cited by 58

See all cited by

Most referenced authors 285

See all reference authors

The GATK joint genotyping workflow is appropriate for calling variants in RNA-seq experiments

Read this article at

Abstract

Electronic supplementary material

Related collections

Primate Tool Use

Most cited references 11

The study of eQTL variations by RNA-seq: from SNPs to phenotypes.

Variant Callers for Next-Generation Sequencing Data: A Comparison Study

Strategies for imputation to whole genome sequence using a single or multi-breed reference population in cattle

Author and article information

Contributors

Journal

Affiliations

Article

History

Funding

Categories

Custom metadata

Comments

Comment on this article

Similar content 35

Cited by 58

Most referenced authors 285