ProMiner: rule-based protein and gene entity recognition

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

Background

Identification of gene and protein names in biomedical text is a challenging task as the corresponding nomenclature has evolved over time. This has led to multiple synonyms for individual genes and proteins, as well as names that may be ambiguous with other gene names or with general English words. The Gene List Task of the BioCreAtIvE challenge evaluation enables comparison of systems addressing the problem of protein and gene name identification on common benchmark data.

Methods

The ProMiner system uses a pre-processed synonym dictionary to identify potential name occurrences in the biomedical text and associate protein and gene database identifiers with the detected matches. It follows a rule-based approach and its search algorithm is geared towards recognition of multi-word names [ 1]. To account for the large number of ambiguous synonyms in the considered organisms, the system has been extended to use specific variants of the detection procedure for highly ambiguous and case-sensitive synonyms. Based on all detected synonyms for one abstract, the most plausible database identifiers are associated with the text. Organism specificity is addressed by a simple procedure based on additionally detected organism names in an abstract.

Results

The extended ProMiner system has been applied to the test cases of the BioCreAtIvE competition with highly encouraging results. In blind predictions, the system achieved an F-measure of approximately 0.8 for the organisms mouse and fly and about 0.9 for the organism yeast.

Related collections

Most cited references 12

Record: found
Abstract: found
Article: not found

A literature network of human genes for high-throughput analysis of gene expression.

A Laegreid, E Hovig, T. Jenssen … (2001)

We have carried out automated extraction of explicit and implicit biomedical knowledge from publicly available gene and text databases to create a gene-to-gene co-citation network for 13,712 named human genes by automated analysis of titles and abstracts in over 10 million MEDLINE records. The associations between genes have been annotated by linking genes to terms from the medical subject heading (MeSH) index and terms from the gene ontology (GO) database. The extracted database and accompanying web tools for gene-expression analysis have collectively been named 'PubGene'. We validated the extracted networks by three large-scale experiments showing that co-occurrence reflects biologically meaningful relationships, thus providing an approach to extract and structure known biology. We validated the applicability of the tools by analyzing two publicly available microarray data sets.

0 comments Cited 140 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: found

Is Open Access

Overview of BioCreAtIvE task 1B: normalized gene lists

Lynette Hirschman, Marc E Colosimo, Alexander Morgan … (2005)

Background Our goal in BioCreAtIve has been to assess the state of the art in text mining, with emphasis on applications that reflect real biological applications, e.g., the curation process for model organism databases. This paper summarizes the BioCreAtIvE task 1B, the "Normalized Gene List" task, which was inspired by the gene list supplied for each curated paper in a model organism database. The task was to produce the correct list of unique gene identifiers for the genes and gene products mentioned in sets of abstracts from three model organisms (Yeast, Fly, and Mouse). Results Eight groups fielded systems for three data sets (Yeast, Fly, and Mouse). For Yeast, the top scoring system (out of 15) achieved 0.92 F-measure (harmonic mean of precision and recall); for Mouse and Fly, the task was more difficult, due to larger numbers of genes, more ambiguity in the gene naming conventions (particularly for Fly), and complex gene names (for Mouse). For Fly, the top F-measure was 0.82 out of 11 systems and for Mouse, it was 0.79 out of 16 systems. Conclusion This assessment demonstrates that multiple groups were able to perform a real biological task across a range of organisms. The performance was dependent on the organism, and specifically on the naming conventions associated with each organism. These results hold out promise that the technology can provide partial automation of the curation process in the near future.

0 comments Cited 56 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

A simple algorithm for identifying abbreviation definitions in biomedical text.

Stefan Schwartz, Marti Hearst (2002)

The volume of biomedical text is growing at a fast rate, creating challenges for humans and computer systems alike. One of these challenges arises from the frequent use of novel abbreviations in these texts, thus requiring that biomedical lexical ontologies be continually updated. In this paper we show that the problem of identifying abbreviations' definitions can be solved with a much simpler algorithm than that proposed by other research efforts. The algorithm achieves 96% precision and 82% recall on a standard test collection, which is at least as good as existing approaches. It also achieves 95% precision and 82% recall on another, larger test set. A notable advantage of the algorithm is that, unlike other approaches, it does not require any training data.

0 comments Cited 38 times – based on 0 reviews      Review now

Bookmark

All references

Author and article information

Conference

Journal ID (nlm-ta): BMC Bioinformatics

Title: BMC Bioinformatics

Publisher: BioMed Central (London )

ISSN (Electronic): 1471-2105

Publication date Collection: 2005

Publication date (Electronic): 24 May 2005

Volume: 6

Issue: Suppl 1

Page: S14

Affiliations

[1 ]Fraunhofer Institute SCAI, Schloss Birlinghoven, 53754 Sankt Augustin, Germany

[2 ]Current address: Aventis Pharma Deutschland, Industriepark Hoechst G879, 65926 Frankfurt am Main, Germany

[3 ]Institute for Informatics, Ludwig-Maximilians-Universität München, Amalienstrasse 17, 80333 München, Germany

Article

Publisher ID: 1471-2105-6-S1-S14

DOI: 10.1186/1471-2105-6-S1-S14

PMC ID: 1869006

PubMed ID: 15960826

SO-VID: a67e6eee-cf61-4dfd-8df0-45c300c4f7e1

License:

This is an open access article distributed under the terms of the Creative Commons Attribution License ( http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Conference name: A critical assessment of text mining methods in molecular biology

Conference location: Granada, Spain

Conference date: March 28–31 2004

History

Comments

Comment on this article

scite_

Cited by 81

See all cited by

ProMiner: rule-based protein and gene entity recognition

Read this article at

Abstract

Background

Methods

Results

Related collections

Recursive Rule based Visual Categorization

Most cited references 12

A literature network of human genes for high-throughput analysis of gene expression.

Overview of BioCreAtIvE task 1B: normalized gene lists

A simple algorithm for identifying abbreviation definitions in biomedical text.

Author and article information

Conference

Affiliations

Article

History

Categories

Comments

Comment on this article

Similar content 207

Cited by 81