tmChem: a high performance approach for chemical named entity recognition and normalization

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

Chemical compounds and drugs are an important class of entities in biomedical research with great potential in a wide range of applications, including clinical medicine. Locating chemical named entities in the literature is a useful step in chemical text mining pipelines for identifying the chemical mentions, their properties, and their relationships as discussed in the literature.

We introduce the tmChem system, a chemical named entity recognizer created by combining two independent machine learning models in an ensemble. We use the corpus released as part of the recent CHEMDNER task to develop and evaluate tmChem, achieving a micro-averaged f-measure of 0.8739 on the CEM subtask (mention-level evaluation) and 0.8745 f-measure on the CDI subtask (abstract-level evaluation). We also report a high-recall combination (0.9212 for CEM and 0.9224 for CDI). tmChem achieved the highest f-measure reported in the CHEMDNER task for the CEM subtask, and the high recall variant achieved the highest recall on both the CEM and CDI tasks.

We report that tmChem is a state-of-the-art tool for chemical named entity recognition and that performance for chemical named entity recognition has now tied (or exceeded) the performance previously reported for genes and diseases. Future research should focus on tighter integration between the named entity recognition and normalization steps for improved performance.

The source code and a trained model for both models of tmChem is available at: http://www.ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/tmChem. The results of running tmChem (Model 2) on PubMed are available in PubTator: http://www.ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/PubTator

Related collections

Most cited references 22

Record: found
Abstract: found
Article: not found

PubTator: a web-based text mining tool for assisting biocuration

Chih-Hsuan Wei, Hung-Yu Kao, Zhiyong Lu (2013)

Manually curating knowledge from biomedical literature into structured databases is highly expensive and time-consuming, making it difficult to keep pace with the rapid growth of the literature. There is therefore a pressing need to assist biocuration with automated text mining tools. Here, we describe PubTator, a web-based system for assisting biocuration. PubTator is different from the few existing tools by featuring a PubMed-like interface, which many biocurators find familiar, and being equipped with multiple challenge-winning text mining algorithms to ensure the quality of its automatic results. Through a formal evaluation with two external user groups, PubTator was shown to be capable of improving both the efficiency and accuracy of manual curation. PubTator is publicly available at http://www.ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/PubTator/.

0 comments Cited 231 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: found

Is Open Access

DNorm: disease name normalization with pairwise learning to rank

Robert Leaman, Rezarta Islamaj Doğan, Zhiyong Lu (2013)

Motivation: Despite the central role of diseases in biomedical research, there have been much fewer attempts to automatically determine which diseases are mentioned in a text—the task of disease name normalization (DNorm)—compared with other normalization tasks in biomedical text mining research. Methods: In this article we introduce the first machine learning approach for DNorm, using the NCBI disease corpus and the MEDIC vocabulary, which combines MeSH® and OMIM. Our method is a high-performing and mathematically principled framework for learning similarities between mentions and concept names directly from training data. The technique is based on pairwise learning to rank, which has not previously been applied to the normalization task but has proven successful in large optimization problems for information retrieval. Results: We compare our method with several techniques based on lexical normalization and matching, MetaMap and Lucene. Our algorithm achieves 0.782 micro-averaged F-measure and 0.809 macro-averaged F-measure, an increase over the highest performing baseline method of 0.121 and 0.098, respectively. Availability: The source code for DNorm is available at http://www.ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/DNorm, along with a web-based demonstration and links to the NCBI disease corpus. Results on PubMed abstracts are available in PubTator: http://www.ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/PubTator Contact: zhiyong.lu@nih.gov

0 comments Cited 166 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

tmVar: a text mining approach for extracting sequence variants in biomedical literature.

Zhiyong Lu, Hung-Wen Kao, Chih-Hsuan Wei … (2013)

Text-mining mutation information from the literature becomes a critical part of the bioinformatics approach for the analysis and interpretation of sequence variations in complex diseases in the post-genomic era. It has also been used for assisting the creation of disease-related mutation databases. Most of existing approaches are rule-based and focus on limited types of sequence variations, such as protein point mutations. Thus, extending their extraction scope requires significant manual efforts in examining new instances and developing corresponding rules. As such, new automatic approaches are greatly needed for extracting different kinds of mutations with high accuracy. Here, we report tmVar, a text-mining approach based on conditional random field (CRF) for extracting a wide range of sequence variants described at protein, DNA and RNA levels according to a standard nomenclature developed by the Human Genome Variation Society. By doing so, we cover several important types of mutations that were not considered in past studies. Using a novel CRF label model and feature set, our method achieves higher performance than a state-of-the-art method on both our corpus (91.4 versus 78.1% in F-measure) and their own gold standard (93.9 versus 89.4% in F-measure). These results suggest that tmVar is a high-performance method for mutation extraction from biomedical literature. tmVar software and its corpus of 500 manually curated abstracts are available for download at http://www.ncbi.nlm.nih.gov/CBBresearch/Lu/pub/tmVar

0 comments Cited 83 times – based on 0 reviews      Review now

Bookmark

All references

Author and article information

Contributors

Robert Leaman

Chih-Hsuan Wei

Zhiyong Lu

Journal

Journal ID (nlm-ta): J Cheminform

Journal ID (iso-abbrev): J Cheminform

Title: Journal of Cheminformatics

Publisher: BioMed Central

ISSN (Electronic): 1758-2946

Publication date Collection: 2015

Publication date (Electronic): 19 January 2015

Volume: 7

Issue: Suppl 1

Page: S3

Affiliations

[1 ]National Center for Biotechnology Information, 8600 Rockville Pike, Bethesda, Maryland 20894, USA

Article

Publisher ID: 1758-2946-7-S1-S3

DOI: 10.1186/1758-2946-7-S1-S3

PMC ID: 4331693

PubMed ID: 25810774

SO-VID: 8139c1b8-b392-4cf7-a376-48f1b3a4da4d

License:

This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

tmChem: a high performance approach for chemical named entity recognition and normalization

Read this article at

Abstract

Related collections

ChemSpider related publications

Most cited references 22

PubTator: a web-based text mining tool for assisting biocuration

DNorm: disease name normalization with pairwise learning to rank

tmVar: a text mining approach for extracting sequence variants in biomedical literature.

Author and article information

Contributors

Journal

Affiliations

Article

History

Categories

Comments

Comment on this article

Similar content 70

Cited by 88

Most referenced authors 311