Large language models identify causal genes in complex trait GWAS

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

Identifying underlying causal genes at significant loci from genome-wide association studies (GWAS) remains a challenging task. Literature evidence for disease-gene co-occurrence, whether through automated approaches or human expert annotation, is one way of nominating causal genes at GWAS loci. However, current automated approaches are limited in accuracy and generalizability, and expert annotation is not scalable to hundreds of thousands of significant findings. Here, we demonstrate that large language models (LLMs) can accurately identify genes likely to be causal at loci from GWAS. By evaluating the performance of GPT-3.5 and GPT-4 on datasets of GWAS loci with high-confidence causal gene annotations, we show that these models outperform state-of-the-art methods in identifying putative causal genes. These findings highlight the potential of LLMs to augment existing approaches to causal gene discovery.

Related collections

Author and article information

Journal

Publisher: medRxiv

Publication date (Electronic preprint): May 31 2024

Article

DOI: 10.1101/2024.05.30.24308179

SO-VID: 14e69630-41e5-4421-b756-37c5ea33c01d

History

Data availability:

Large language models identify causal genes in complex trait GWAS

Read this article at

Abstract

Related collections

Exponential Random Graph Models

Author and article information

Journal

Article

History

Comments

Comment on this article

Similar content 426

Cited by 1