Automated medical chart review for breast cancer outcomes research: a novel natural language processing extraction system

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

Background

Manually extracted data points from health records are collated on an institutional, provincial, and national level to facilitate clinical research. However, the labour-intensive clinical chart review process puts an increasing burden on healthcare system budgets. Therefore, an automated information extraction system is needed to ensure the timeliness and scalability of research data.

Methods

We used a dataset of 100 synoptic operative and 100 pathology reports, evenly split into 50 reports in training and test sets for each report type. The training set guided our development of a Natural Language Processing (NLP) extraction pipeline system, which accepts scanned images of operative and pathology reports. The system uses a combination of rule-based and transfer learning methods to extract numeric encodings from text. We also developed visualization tools to compare the manual and automated extractions. The code for this paper was made available on GitHub.

Results

A test set of 50 operative and 50 pathology reports were used to evaluate the extraction accuracies of the NLP pipeline. Gold standard, defined as manual extraction by expert reviewers, yielded accuracies of 90.5% for operative reports and 96.0% for pathology reports, while the NLP system achieved overall 91.9% (operative) and 95.4% (pathology) accuracy. The pipeline successfully extracted outcomes data pertinent to breast cancer tumor characteristics (e.g. presence of invasive carcinoma, size, histologic type), prognostic factors (e.g. number of lymph nodes with micro-metastases and macro-metastases, pathologic stage), and treatment-related variables (e.g. margins, neo-adjuvant treatment, surgical indication) with high accuracy. Out of the 48 variables across operative and pathology codebooks, NLP yielded 43 variables with F-scores of at least 0.90; in comparison, a trained human annotator yielded 44 variables with F-scores of at least 0.90.

Conclusions

The NLP system achieves near-human-level accuracy in both operative and pathology reports using a minimal curated dataset. This system uniquely provides a robust solution for transparent, adaptable, and scalable automation of data extraction from patient health records. It may serve to advance breast cancer clinical research by facilitating collection of vast amounts of valuable health data at a population level.

Supplementary Information

The online version contains supplementary material available at 10.1186/s12874-022-01583-z.

Related collections

Most cited references 23

Record: found
Abstract: found
Article: not found

Global surveillance of trends in cancer survival 2000–14 (CONCORD-3): analysis of individual records for 37 513 025 patients diagnosed with one of 18 cancers from 322 population-based registries in 71 countries

Claudia Allemani, Tomohiro Matsuda, Veronica Di Carlo … (2018)

In 2015, the second cycle of the CONCORD programme established global surveillance of cancer survival as a metric of the effectiveness of health systems and to inform global policy on cancer control. CONCORD-3 updates the worldwide surveillance of cancer survival to 2014.

0 comments Cited 1806 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: found

Is Open Access

BioBERT: a pre-trained biomedical language representation model for biomedical text mining

Jinhyuk Lee, Wonjin Yoon, Sungdong Kim … (2019)

Abstract Motivation Biomedical text mining is becoming increasingly important as the number of biomedical documents rapidly grows. With the progress in natural language processing (NLP), extracting valuable information from biomedical literature has gained popularity among researchers, and deep learning has boosted the development of effective biomedical text mining models. However, directly applying the advancements in NLP to biomedical text mining often yields unsatisfactory results due to a word distribution shift from general domain corpora to biomedical corpora. In this article, we investigate how the recently introduced pre-trained language model BERT can be adapted for biomedical corpora. Results We introduce BioBERT (Bidirectional Encoder Representations from Transformers for Biomedical Text Mining), which is a domain-specific language representation model pre-trained on large-scale biomedical corpora. With almost the same architecture across tasks, BioBERT largely outperforms BERT and previous state-of-the-art models in a variety of biomedical text mining tasks when pre-trained on biomedical corpora. While BERT obtains performance comparable to that of previous state-of-the-art models, BioBERT significantly outperforms them on the following three representative biomedical text mining tasks: biomedical named entity recognition (0.62% F1 score improvement), biomedical relation extraction (2.80% F1 score improvement) and biomedical question answering (12.24% MRR improvement). Our analysis results show that pre-training BERT on biomedical corpora helps it to understand complex biomedical texts. Availability and implementation We make the pre-trained weights of BioBERT freely available at https://github.com/naver/biobert-pretrained, and the source code for fine-tuning BioBERT available at https://github.com/dmis-lab/biobert.

0 comments Cited 955 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: found

Is Open Access

Deep Patient: An Unsupervised Representation to Predict the Future of Patients from the Electronic Health Records

Riccardo Miotto, Li Li, Brian A Kidd … (2016)

Secondary use of electronic health records (EHRs) promises to advance clinical research and better inform clinical decision making. Challenges in summarizing and representing patient data prevent widespread practice of predictive modeling using EHRs. Here we present a novel unsupervised deep feature learning method to derive a general-purpose patient representation from EHR data that facilitates clinical predictive modeling. In particular, a three-layer stack of denoising autoencoders was used to capture hierarchical regularities and dependencies in the aggregated EHRs of about 700,000 patients from the Mount Sinai data warehouse. The result is a representation we name “deep patient”. We evaluated this representation as broadly predictive of health states by assessing the probability of patients to develop various diseases. We performed evaluation using 76,214 test patients comprising 78 diseases from diverse clinical domains and temporal windows. Our results significantly outperformed those achieved using representations based on raw EHR data and alternative feature learning strategies. Prediction performance for severe diabetes, schizophrenia, and various cancers were among the top performing. These findings indicate that deep learning applied to EHRs can derive patient representations that offer improved clinical predictions, and could provide a machine learning framework for augmenting clinical decision systems.

0 comments Cited 397 times – based on 0 reviews      Review now

Bookmark

All references

Author and article information

Contributors

Kathryn V. Isaac: kathryn.isaac@ubc.ca

Journal

Journal ID (nlm-ta): BMC Med Res Methodol

Journal ID (iso-abbrev): BMC Med Res Methodol

Title: BMC Medical Research Methodology

Publisher: BioMed Central (London )

ISSN (Electronic): 1471-2288

Publication date (Electronic): 12 May 2022

Publication date PMC-release: 12 May 2022

Publication date Collection: 2022

Volume: 22

Electronic Location Identifier: 136

Affiliations

[1 ]GRID grid.17091.3e, ISNI 0000 0001 2288 9830, Department of Computer Science, , University of British Columbia, Faculty of Science, ; 201-2366 Main Mall, Vancouver, BC V6T 1Z4 Canada

[2 ]GRID grid.460559.b, Prevention of Organ Failure (PROOF) Centre of Excellence, ; 1190 Hornby Street, Vancouver, BC V6Z 2K5 Canada

[3 ]GRID grid.17091.3e, ISNI 0000 0001 2288 9830, Department of Surgery, , University of British Columbia, Faculty of Medicine, ; 2221 Wesbrook Mall, Vancouver, BC V5Z 1M9 Canada

Article

Publisher ID: 1583

DOI: 10.1186/s12874-022-01583-z

PMC ID: 9101856

PubMed ID: 35549854

SO-VID: f488101e-ce42-4308-bcd7-c0ea83888f24

License:

Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

History

Date received : 11 October 2021

Date accepted : 15 March 2022

Custom metadata

ScienceOpen disciplines: Medicine

Keywords: natural language processing,breast cancer,health data

Data availability:

ScienceOpen disciplines: Medicine

Keywords: natural language processing, breast cancer, health data

Comments

Comment on this article

scite_

Smart Citations

Citing PublicationsSupportingMentioningContrasting

View Citations

See how this article has been cited at scite.ai

scite shows how a scientific paper has been cited by providing the context of the citation, a classification describing whether it supports, mentions, or contrasts the cited claim, and a label indicating in which section the citation was made.

Cited by 6

See all cited by

Most referenced authors 904

See all reference authors

Automated medical chart review for breast cancer outcomes research: a novel natural language processing extraction system

Read this article at

Abstract

Background

Methods

Results

Conclusions

Supplementary Information

Related collections

Radiology and Natural Language Processing

Most cited references 23

Global surveillance of trends in cancer survival 2000–14 (CONCORD-3): analysis of individual records for 37 513 025 patients diagnosed with one of 18 cancers from 322 population-based registries in 71 countries

BioBERT: a pre-trained biomedical language representation model for biomedical text mining

Deep Patient: An Unsupervised Representation to Predict the Future of Patients from the Electronic Health Records

Author and article information

Contributors

Journal

Affiliations

Article

History

Categories

Custom metadata

Comments

Comment on this article

Similar content 224

Cited by 6

Most referenced authors 904