Deep Patient: An Unsupervised Representation to Predict the Future of Patients from the Electronic Health Records

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

Secondary use of electronic health records (EHRs) promises to advance clinical research and better inform clinical decision making. Challenges in summarizing and representing patient data prevent widespread practice of predictive modeling using EHRs. Here we present a novel unsupervised deep feature learning method to derive a general-purpose patient representation from EHR data that facilitates clinical predictive modeling. In particular, a three-layer stack of denoising autoencoders was used to capture hierarchical regularities and dependencies in the aggregated EHRs of about 700,000 patients from the Mount Sinai data warehouse. The result is a representation we name “deep patient”. We evaluated this representation as broadly predictive of health states by assessing the probability of patients to develop various diseases. We performed evaluation using 76,214 test patients comprising 78 diseases from diverse clinical domains and temporal windows. Our results significantly outperformed those achieved using representations based on raw EHR data and alternative feature learning strategies. Prediction performance for severe diabetes, schizophrenia, and various cancers were among the top performing. These findings indicate that deep learning applied to EHRs can derive patient representations that offer improved clinical predictions, and could provide a machine learning framework for augmenting clinical decision systems.

Related collections

Most cited references 26

Record: found
Abstract: found
Article: found

Is Open Access

Methods and dimensions of electronic health record data quality assessment: enabling reuse for clinical research

Nicole Weiskopf, Chunhua Weng (2013)

Objective To review the methods and dimensions of data quality assessment in the context of electronic health record (EHR) data reuse for research. Materials and methods A review of the clinical research literature discussing data quality assessment methodology for EHR data was performed. Using an iterative process, the aspects of data quality being measured were abstracted and categorized, as well as the methods of assessment used. Results Five dimensions of data quality were identified, which are completeness, correctness, concordance, plausibility, and currency, and seven broad categories of data quality assessment methods: comparison with gold standards, data element agreement, data source agreement, distribution comparison, validity checks, log review, and element presence. Discussion Examination of the methods by which clinical researchers have investigated the quality and suitability of EHR data for research shows that there are fundamental features of data quality, which may be difficult to measure, as well as proxy dimensions. Researchers interested in the reuse of EHR data for clinical research are recommended to consider the adoption of a consistent taxonomy of EHR data quality, to remain aware of the task-dependence of data quality, to integrate work on data quality assessment from other fields, and to adopt systematic, empirically driven, statistically based methods of data quality assessment. Conclusion There is currently little consistency or potential generalizability in the methods used to assess EHR data quality. If the reuse of EHR data for clinical research is to become accepted, researchers should adopt validated, systematic methods of EHR data quality assessment.

0 comments Cited 316 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

A simple algorithm for identifying negated findings and diseases in discharge summaries.

Wendy W. Chapman, Will Bridewell, Paul Hanbury … (2001)

Narrative reports in medical records contain a wealth of information that may augment structured data for managing patient information and predicting trends in diseases. Pertinent negatives are evident in text but are not usually indexed in structured databases. The objective of the study reported here was to test a simple algorithm for determining whether a finding or disease mentioned within narrative medical reports is present or absent. We developed a simple regular expression algorithm called NegEx that implements several phrases indicating negation, filters out sentences containing phrases that falsely appear to be negation phrases, and limits the scope of the negation phrases. We compared NegEx against a baseline algorithm that has a limited set of negation phrases and a simpler notion of scope. In a test of 1235 findings and diseases in 1000 sentences taken from discharge summaries indexed by physicians, NegEx had a specificity of 94.5% (versus 85.3% for the baseline), a positive predictive value of 84.5% (versus 68.4% for the baseline) while maintaining a reasonable sensitivity of 77.8% (versus 88.3% for the baseline). We conclude that with little implementation effort a simple regular expression algorithm for determining whether a finding or disease is absent can identify a large portion of the pertinent negatives from discharge summaries.

0 comments Cited 239 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

Identification of type 2 diabetes subgroups through topological analysis of patient similarity.

Li Li, Wei-Yi Cheng, Benjamin Glicksberg … (2015)

Type 2 diabetes (T2D) is a heterogeneous complex disease affecting more than 29 million Americans alone with a rising prevalence trending toward steady increases in the coming decades. Thus, there is a pressing clinical need to improve early prevention and clinical management of T2D and its complications. Clinicians have understood that patients who carry the T2D diagnosis have a variety of phenotypes and susceptibilities to diabetes-related complications. We used a precision medicine approach to characterize the complexity of T2D patient populations based on high-dimensional electronic medical records (EMRs) and genotype data from 11,210 individuals. We successfully identified three distinct subgroups of T2D from topology-based patient-patient networks. Subtype 1 was characterized by T2D complications diabetic nephropathy and diabetic retinopathy; subtype 2 was enriched for cancer malignancy and cardiovascular diseases; and subtype 3 was associated most strongly with cardiovascular diseases, neurological diseases, allergies, and HIV infections. We performed a genetic association analysis of the emergent T2D subtypes to identify subtype-specific genetic markers and identified 1279, 1227, and 1338 single-nucleotide polymorphisms (SNPs) that mapped to 425, 322, and 437 unique genes specific to subtypes 1, 2, and 3, respectively. By assessing the human disease-SNP association for each subtype, the enriched phenotypes and biological functions at the gene level for each subtype matched with the disease comorbidities and clinical differences that we identified through EMRs. Our approach demonstrates the utility of applying the precision medicine paradigm in T2D and the promise of extending the approach to the study of other complex, multifactorial diseases.

0 comments Cited 187 times – based on 0 reviews      Review now

Bookmark

All references

Author and article information

Journal

Journal ID (nlm-ta): Sci Rep

Journal ID (iso-abbrev): Sci Rep

Title: Scientific Reports

Publisher: Nature Publishing Group

ISSN (Electronic): 2045-2322

Publication date (Electronic): 17 May 2016

Publication date Collection: 2016

Volume: 6

Electronic Location Identifier: 26094

Affiliations

[1 ]Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai , New York, NY, USA

[2 ]Harris Center for Precision Wellness, Icahn School of Medicine at Mount Sinai , New York, NY, USA

[3 ]Icahn Institute for Genomics and Multiscale Biology, Icahn School of Medicine at Mount Sinai , New York, NY, USA

Author notes

[a ] joel.dudley@ 123456mssm.edu

Article

Publisher Item ID: srep26094

DOI: 10.1038/srep26094

PMC ID: 4869115

PubMed ID: 27185194

SO-VID: 3a198b76-79c1-4c05-825d-79941a50cf32

License:

This work is licensed under a Creative Commons Attribution 4.0 International License. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in the credit line; if the material is not included under the Creative Commons license, users will need to obtain permission from the license holder to reproduce the material. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/

History

Date received : 28 January 2016

Date accepted : 27 April 2016

Comments

Comment on this article

scite_

Cited by 361

See all cited by

Most referenced authors 372

See all reference authors

- Version 1

Deep Patient: An Unsupervised Representation to Predict the Future of Patients from the Electronic Health Records

Read this article at

Abstract

Related collections

Research Paper of the Future and the Reproducible Research Compendium

Most cited references 26

Methods and dimensions of electronic health record data quality assessment: enabling reuse for clinical research

A simple algorithm for identifying negated findings and diseases in discharge summaries.

Identification of type 2 diabetes subgroups through topological analysis of patient similarity.

Author and article information

Journal

Affiliations

Author notes

Article

History

Categories

Comments

Comment on this article

Similar content 428

Cited by 361

Most referenced authors 372