Structured information extraction from scientific text with large language models

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

Extracting structured knowledge from scientific text remains a challenging task for machine learning models. Here, we present a simple approach to joint named entity recognition and relation extraction and demonstrate how pretrained large language models (GPT-3, Llama-2) can be fine-tuned to extract useful records of complex scientific knowledge. We test three representative tasks in materials chemistry: linking dopants and host materials, cataloging metal-organic frameworks, and general composition/phase/morphology/application information extraction. Records are extracted from single sentences or entire paragraphs, and the output can be returned as simple English sentences or a more structured format such as a list of JSON objects. This approach represents a simple, accessible, and highly flexible route to obtaining large databases of structured specialized scientific knowledge extracted from research papers.

Abstract

Extracting scientific data from published research is a complex task required specialised tools. Here the authors present a scheme based on large language models to automatise the retrieval of information from text in a flexible and accessible manner.

Related collections

Most cited references 28

Record: found
Abstract: not found
Article: not found

Unsupervised word embeddings capture latent knowledge from materials science literature

Vahe Tshitoyan, John Dagdelen, Leigh Weston … (2019)

0 comments Cited 252 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: not found
Article: not found

MOF-Based Membranes for Gas Separations

Qihui Qian, Patrick A. Asinger, Moon Lee … (2020)

0 comments Cited 252 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: found

Is Open Access

Recent advances and applications of deep learning methods in materials science

Kamal Choudhary, Brian DeCost, Chi Jen Chen … (2022)

Deep learning (DL) is one of the fastest-growing topics in materials data science, with rapidly emerging applications spanning atomistic, image-based, spectral, and textual data modalities. DL allows analysis of unstructured data and automated identification of features. The recent development of large materials databases has fueled the application of DL methods in atomistic prediction in particular. In contrast, advances in image and spectral data have largely leveraged synthetic data enabled by high-quality forward models as well as by generative unsupervised DL methods. In this article, we present a high-level overview of deep learning methods followed by a detailed discussion of recent developments of deep learning in atomistic simulation, materials imaging, spectral analysis, and natural language processing. For each modality we discuss applications involving both theoretical and experimental data, typical modeling approaches with their strengths and limitations, and relevant publicly available software and datasets. We conclude the review with a discussion of recent cross-cutting work related to uncertainty quantification in this field and a brief perspective on limitations, challenges, and potential growth areas for DL methods in materials science.

0 comments Cited 93 times – based on 0 reviews      Review now

Bookmark

All references

Author and article information

Contributors

Anubhav Jain:

ORCID: http://orcid.org/0000-0001-5893-9967

ajain@lbl.gov

Journal

Journal ID (nlm-ta): Nat Commun

Journal ID (iso-abbrev): Nat Commun

Title: Nature Communications

Publisher: Nature Publishing Group UK (London )

ISSN (Electronic): 2041-1723

Publication date (Electronic): 15 February 2024

Publication date PMC-release: 15 February 2024

Publication date Collection: 2024

Volume: 15

Electronic Location Identifier: 1418

Affiliations

[1 ]Lawrence Berkeley National Laboratory, ( https://ror.org/02jbv0t02) Berkeley, CA USA

[2 ]GRID grid.47840.3f, ISNI 0000 0001 2181 7878, Materials Science and Engineering Department, , University of California, ; Berkeley, CA USA

Author information

John Dagdelen http://orcid.org/0000-0003-2181-4815

Alexander Dunn http://orcid.org/0000-0002-8567-1879

Andrew S. Rosen http://orcid.org/0000-0002-0141-7006

Kristin A. Persson http://orcid.org/0000-0003-2495-5509

Anubhav Jain http://orcid.org/0000-0001-5893-9967

Article

Publisher ID: 45563

DOI: 10.1038/s41467-024-45563-x

PMC ID: 10869356

PubMed ID: 38360817

SO-VID: 8ded5ab0-4433-47ba-8b92-cdb91c8514f9

License:

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

History

Date received : 17 March 2023

Date accepted : 22 January 2024

Custom metadata

ScienceOpen disciplines: Uncategorized

Keywords: materials science,theory and computation,scientific data,databases

Data availability:

ScienceOpen disciplines: Uncategorized

Keywords: materials science, theory and computation, scientific data, databases

Comments

Comment on this article

scite_

Smart Citations

Citing PublicationsSupportingMentioningContrasting

View Citations

See how this article has been cited at scite.ai

scite shows how a scientific paper has been cited by providing the context of the citation, a classification describing whether it supports, mentions, or contrasts the cited claim, and a label indicating in which section the citation was made.

Cited by 18

See all cited by

Most referenced authors 297

See all reference authors

Structured information extraction from scientific text with large language models

Read this article at

Abstract

Abstract

Related collections

Commodity Bulletin

Most cited references 28

Unsupervised word embeddings capture latent knowledge from materials science literature

MOF-Based Membranes for Gas Separations

Recent advances and applications of deep learning methods in materials science

Author and article information

Contributors

Journal

Affiliations

Author information

Article

History

Categories

Custom metadata

Comments

Comment on this article

Similar content 63

Cited by 18

Most referenced authors 297