TCGA Expedition: A Data Acquisition and Management System for TCGA Data

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

Background

The Cancer Genome Atlas Project (TCGA) is a National Cancer Institute effort to profile at least 500 cases of 20 different tumor types using genomic platforms and to make these data, both raw and processed, available to all researchers. TCGA data are currently over 1.2 Petabyte in size and include whole genome sequence (WGS), whole exome sequence, methylation, RNA expression, proteomic, and clinical datasets. Publicly accessible TCGA data are released through public portals, but many challenges exist in navigating and using data obtained from these sites. We developed TCGA Expedition to support the research community focused on computational methods for cancer research. Data obtained, versioned, and archived using TCGA Expedition supports command line access at high-performance computing facilities as well as some functionality with third party tools. For a subset of TCGA data collected at University of Pittsburgh, we also re-associate TCGA data with de-identified data from the electronic health records. Here we describe the software as well as the architecture of our repository, methods for loading of TCGA data to multiple platforms, and security and regulatory controls that conform to federal best practices.

Results

TCGA Expedition software consists of a set of scripts written in Bash, Python and Java that download, extract, harmonize, version and store all TCGA data and metadata. The software generates a versioned, participant- and sample-centered, local TCGA data directory with metadata structures that directly reference the local data files as well as the original data files. The software supports flexible searches of the data via a web portal, user-centric data tracking tools, and data provenance tools. Using this software, we created a collaborative repository, the Pittsburgh Genome Resource Repository (PGRR) that enabled investigators at our institution to work with all TCGA data formats, and to interrogate these data with analysis pipelines, and associated tools. WGS data are especially challenging for individual investigators to use, due to issues with downloading, storage, and processing; having locally accessible WGS BAM files has proven invaluable.

Conclusion

Our open-source, freely available TCGA Expedition software can be used to create a local collaborative infrastructure for acquiring, managing, and analyzing TCGA data and other large public datasets.

Related collections

Most cited references 8

Record: found
Abstract: found
Article: not found

Integrative analysis of complex cancer genomics and clinical profiles using the cBioPortal.

J. Gao, B. A. Aksoy, U Dogrusoz … (2015)

The cBioPortal for Cancer Genomics (http://cbioportal.org) provides a Web resource for exploring, visualizing, and analyzing multidimensional cancer genomics data. The portal reduces molecular profiling data from cancer tissues and cell lines into readily understandable genetic, epigenetic, gene expression, and proteomic events. The query interface combined with customized data storage enables researchers to interactively explore genetic alterations across samples, genes, and pathways and, when available in the underlying data, to link these to clinical outcomes. The portal provides graphical summaries of gene-level data from multiple platforms, network visualization and analysis, survival analysis, patient-centric queries, and software programmatic access. The intuitive Web interface of the portal makes complex cancer genomics profiles accessible to researchers and clinicians without requiring bioinformatics expertise, thus facilitating biological discoveries. Here, we provide a practical guide to the analysis and visualization features of the cBioPortal for Cancer Genomics.

0 comments Cited 6312 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

Next-generation sequencing platforms.

Elaine R. Mardis (2013)

Automated DNA sequencing instruments embody an elegant interplay among chemistry, engineering, software, and molecular biology and have built upon Sanger's founding discovery of dideoxynucleotide sequencing to perform once-unfathomable tasks. Combined with innovative physical mapping approaches that helped to establish long-range relationships between cloned stretches of genomic DNA, fluorescent DNA sequencers produced reference genome sequences for model organisms and for the reference human genome. New types of sequencing instruments that permit amazing acceleration of data-collection rates for DNA sequencing have been developed. The ability to generate genome-scale data sets is now transforming the nature of biological inquiry. Here, I provide an historical perspective of the field, focusing on the fundamental developments that predated the advent of next-generation sequencing instruments and providing information about how these instruments work, their application to biological research, and the newest types of sequencers that can extract data from single DNA molecules.

0 comments Cited 164 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

Characterizing DNA methylation alterations from The Cancer Genome Atlas.

Daniel Weisenberger (2013)

The Cancer Genome Atlas (TCGA) Research Network is an ambitious multi-institutional consortium effort aimed at characterizing sequence, copy number, gene (mRNA) expression, microRNA expression, and DNA methylation alterations in 30 cancer types. TCGA data have become an extraordinary resource for basic, translational, and clinical researchers and have the potential to shape cancer diagnostic and treatment strategies. DNA methylation changes are integral to all aspects of cancer genomics and have been shown to have important associations with gene expression, sequence, and copy number changes. This Review highlights the knowledge gained from DNA methylation alterations in human cancers from TCGA.

0 comments Cited 85 times – based on 0 reviews      Review now

Bookmark

All references

Author and article information

Contributors

Esmaeil Ebrahimie: Role: Editor

Journal

Journal ID (nlm-ta): PLoS One

Journal ID (iso-abbrev): PLoS ONE

Journal ID (publisher-id): plos

Journal ID (pmc): plosone

Title: PLoS ONE

Publisher: Public Library of Science (San Francisco, CA USA )

ISSN (Electronic): 1932-6203

Publication date (Electronic): 27 October 2016

Publication date Collection: 2016

Volume: 11

Issue: 10

Electronic Location Identifier: e0165395

Affiliations

[1 ]Department of Biomedical Informatics, University of Pittsburgh School of Medicine, Pittsburgh, PA, United States of America

[2 ]University of Pittsburgh Cancer Institute, Pittsburgh, PA, United States of America

[3 ]Department of Human Genetics, University of Pittsburgh School of Public Health, Pittsburgh, PA, United States of America

[4 ]Center for Simulation and Modeling, University of Pittsburgh, Pittsburgh, PA, United States of America

[5 ]Pittsburgh Supercomputing Center, Carnegie Mellon University, Pittsburgh, PA, United States of America

[6 ]Department of Pharmacology and Cell Biology, University of Pittsburgh, Pittsburgh, PA, United States of America

[7 ]Magee-Women’s Research Institute, Pittsburgh, PA, United States of America

[8 ]UPMC Corporate Services, Pittsburgh, PA, United States of America

[9 ]Institute for Precision Medicine, University of Pittsburgh, Pittsburgh, PA, United States of America

Flinders University, AUSTRALIA

Author notes

Competing Interests: The authors have declared that no competing interests exist.

Conceptualization: URC MMB JMB RSJ.
Data curation: OPM PDB.
Formal analysis: URC OPM PDB AC SL AVL.
Funding acquisition: JMB RSJ.
Methodology: MMB RSJ.
Project administration: RSJ.
Resources: PDB AF KFW ZZ RB JRS.
Software: OPM PDB AC SL AF KFW ZZ RB JRS RSJ.
Supervision: RSJ.
Validation: OPM PDB AC SL AF KFW ZZ RB JRS RSJ.
Visualization: OPM RSJ.
Writing – original draft: URC MMB RDB RSJ OPM.
Writing – review & editing: URV OPM MMD PDB AB RSJ.

* E-mail: rebeccaj@ 123456pitt.edu

Article

Publisher ID: PONE-D-16-11187

DOI: 10.1371/journal.pone.0165395

PMC ID: 5082933

PubMed ID: 27788220

SO-VID: 49c6461c-8f9a-4374-ae43-077c471c06eb

License:

This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

History

Date received : 17 March 2016

Date accepted : 11 October 2016

Page count

Figures: 3, Tables: 1, Pages: 14

Funding

Funded by: funder-id http://dx.doi.org/10.13039/100000001, National Science Foundation;

Award ID: 144064

Funded by: funder-id http://dx.doi.org/10.13039/100000054, National Cancer Institute;

Award ID: P30CA047904

Funded by: University of Pittsburgh Institute for Personalized Medicine

Award Recipient : Jeremy M. Berg

We gratefully acknowledge support from the Institute for Precision Medicine at the University of Pittsburgh and the University of Pittsburgh Cancer Institute. The project used the UPCI Tissue and Research Pathology Services that is supported in part by award P30CA047904 from the National Cancer Institute ( https://na01.safelinks.protection.outlook.com/?url=www.cancer.gov&data=01%7C01%7Crebeccaj%40pitt.edu%7Cf6b178379b3b481e214008d3f8e68cb2%7C9ef9f489e0a04eeb87cc3a526112fd0d%7C1&sdata=XmoeftTTF9W%2B%2BFuUmlFfSthqmb%2B2QpBFd%2FvAzjjIer4%3D&reserved=0). Upgrades to the Pitt networking infrastructure to support the collaboration were funded through National Science Foundation CC*IIE award #144064. This work used the Data Exacell, which is supported by National Science Foundation award number ACI-1261721, at the Pittsburgh Supercomputing Center (PSC). Additionally, this project used the UPCI Cancer Bioinformatics Services, which is supported in part by the National Cancer Institute award P30CA047904. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Custom metadata

Data Availability Source code and documentation needed to replicate this study are available from Github at the following two links: https://github.com/TCGAExpedition/tcga-expedition/blob/master/TCGA-Expedition.User.Guide.docx http://github.com/TCGAExpedition Please also note other relevant links to information: Project home page: https://www.ipm.pitt.edu/cancer-genome-atlas-project Training movie: https://www.youtube.com/watch?v=bpcQiBNf8Fc.

ScienceOpen disciplines: Uncategorized

Data availability:

ScienceOpen disciplines: Uncategorized

Comments

Comment on this article

scite_

Cited by 35

See all cited by

TCGA Expedition: A Data Acquisition and Management System for TCGA Data

Read this article at

Abstract

Background

Results

Conclusion

Related collections

PLOS Climate

Most cited references 8

Integrative analysis of complex cancer genomics and clinical profiles using the cBioPortal.

Next-generation sequencing platforms.

Characterizing DNA methylation alterations from The Cancer Genome Atlas.

Author and article information

Contributors

Journal

Affiliations

Author notes

Article

History

Page count

Funding

Categories

Custom metadata

Comments

Comment on this article

Similar content 85

Cited by 35