BioCoder: a benchmark for bioinformatics code generation with large language models

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

Summary

Pretrained large language models (LLMs) have significantly improved code generation. As these models scale up, there is an increasing need for the output to handle more intricate tasks and to be appropriately specialized to particular domains. Here, we target bioinformatics due to the amount of domain knowledge, algorithms, and data operations this discipline requires. We present BioCoder, a benchmark developed to evaluate LLMs in generating bioinformatics-specific code. BioCoder spans much of the field, covering cross-file dependencies, class declarations, and global variables. It incorporates 1026 Python functions and 1243 Java methods extracted from GitHub, along with 253 examples from the Rosalind Project, all pertaining to bioinformatics. Using topic modeling, we show that the overall coverage of the included code is representative of the full spectrum of bioinformatics calculations. BioCoder incorporates a fuzz-testing framework for evaluation. We have applied it to evaluate various models including InCoder, CodeGen, CodeGen2, SantaCoder, StarCoder, StarCoder+, InstructCodeT5+, GPT-3.5, and GPT-4. Furthermore, we fine-tuned one model (StarCoder), demonstrating that our training dataset can enhance the performance on our testing benchmark (by >15% in terms of Pass@K under certain prompt configurations and always >3%). The results highlight two key aspects of successful models: (i) Successful models accommodate a long prompt (>2600 tokens) with full context, including functional dependencies. (ii) They contain domain-specific knowledge of bioinformatics, beyond just general coding capability. This is evident from the performance gain of GPT-3.5/4 compared to the smaller models on our benchmark (50% versus up to 25%).

Availability and implementation

All datasets, benchmark, Docker images, and scripts required for testing are available at: https://github.com/gersteinlab/biocoder and https://biocoder-benchmark.github.io/.

Related collections

Most cited references 6

Record: found
Abstract: not found
Conference Proceedings: not found

Language models are unsupervised multitask learners

A Radford, J Wu, R. CHILD … (2019)

0 comments Cited 475 times – based on 0 reviews

Bookmark

Record: found
Abstract: found
Article: found

Is Open Access

A large-scale analysis of bioinformatics code on GitHub

Pamela Russell, Rachel Johnson, Shreyas Ananthan … (2018)

In recent years, the explosion of genomic data and bioinformatic tools has been accompanied by a growing conversation around reproducibility of results and usability of software. However, the actual state of the body of bioinformatics software remains largely unknown. The purpose of this paper is to investigate the state of source code in the bioinformatics community, specifically looking at relationships between code properties, development activity, developer communities, and software impact. To investigate these issues, we curated a list of 1,720 bioinformatics repositories on GitHub through their mention in peer-reviewed bioinformatics articles. Additionally, we included 23 high-profile repositories identified by their popularity in an online bioinformatics forum. We analyzed repository metadata, source code, development activity, and team dynamics using data made available publicly through the GitHub API, as well as article metadata. We found key relationships within our dataset, including: certain scientific topics are associated with more active code development and higher community interest in the repository; most of the code in the main dataset is written in dynamically typed languages, while most of the code in the high-profile set is statically typed; developer team size is associated with community engagement and high-profile repositories have larger teams; the proportion of female contributors decreases for high-profile repositories and with seniority level in author lists; and, multiple measures of project impact are associated with the simple variable of whether the code was modified at all after paper publication. In addition to providing the first large-scale analysis of bioinformatics code to our knowledge, our work will enable future analysis through publicly available data, code, and methods. Code to generate the dataset and reproduce the analysis is provided under the MIT license at https://github.com/pamelarussell/github-bioinformatics. Data are available at https://doi.org/10.17605/OSF.IO/UWHX8.

0 comments Cited 18 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: found

Is Open Access

An innovative approach for testing bioinformatics programs using metamorphic testing

Tsong Yueh Chen, Joshua Ho, Huai Liu … (2009)

Background Recent advances in experimental and computational technologies have fueled the development of many sophisticated bioinformatics programs. The correctness of such programs is crucial as incorrectly computed results may lead to wrong biological conclusion or misguide downstream experimentation. Common software testing procedures involve executing the target program with a set of test inputs and then verifying the correctness of the test outputs. However, due to the complexity of many bioinformatics programs, it is often difficult to verify the correctness of the test outputs. Therefore our ability to perform systematic software testing is greatly hindered. Results We propose to use a novel software testing technique, metamorphic testing (MT), to test a range of bioinformatics programs. Instead of requiring a mechanism to verify whether an individual test output is correct, the MT technique verifies whether a pair of test outputs conform to a set of domain specific properties, called metamorphic relations (MRs), thus greatly increases the number and variety of test cases that can be applied. To demonstrate how MT is used in practice, we applied MT to test two open-source bioinformatics programs, namely GNLab and SeqMap. In particular we show that MT is simple to implement, and is effective in detecting faults in a real-life program and some artificially fault-seeded programs. Further, we discuss how MT can be applied to test programs from various domains of bioinformatics. Conclusion This paper describes the application of a simple, effective and automated technique to systematically test a range of bioinformatics programs. We show how MT can be implemented in practice through two real-life case studies. Since many bioinformatics programs, particularly those for large scale simulation and data analysis, are hard to test systematically, their developers may benefit from using MT as part of the testing strategy. Therefore our work represents a significant step towards software reliability in bioinformatics.

0 comments Cited 18 times – based on 0 reviews      Review now

Bookmark

All references

Author and article information

Contributors

Xiangru Tang

Bill Qian

Rick Gao

Jiakang Chen

Xinyun Chen

Mark B Gerstein:

ORCID: https://orcid.org/0000-0002-9746-3719

Journal

Journal ID (nlm-ta): Bioinformatics

Journal ID (iso-abbrev): Bioinformatics

Journal ID (publisher-id): bioinformatics

Title: Bioinformatics

Publisher: Oxford University Press

ISSN (Print): 1367-4803

ISSN (Electronic): 1367-4811

Publication date Collection: July 2024

Publication date (Electronic): 28 June 2024

Publication date PMC-release: 28 June 2024

Volume: 40

Issue: Suppl 1 , ISMB 2024 Proceedings

Pages: i266-i276

Affiliations

Department of Computer Science, Yale University , New Haven, CT 06520, United States

Google Deepmind , Mountain View, CA 94043, United States

Department of Computer Science, Yale University , New Haven, CT 06520, United States

Program in Computational Biology & Bioinformatics, Yale University , New Haven, CT 06520, United States

Department of Molecular Biophysics & Biochemistry, Yale University , New Haven, CT 06520, United States

Department of Statistics & Data Science, Yale University , New Haven, CT 06520, United States

Department of Biomedical Informatics & Data Science, Yale University , New Haven, CT 06520, United States

Author notes

Corresponding author. Department of Computer Science, Yale University, New Haven, CT 06520, United States. E-mail: mark@ 123456gersteinlab.org (M.B.G.)

Author information

Mark B Gerstein https://orcid.org/0000-0002-9746-3719

Article

Publisher ID: btae230

DOI: 10.1093/bioinformatics/btae230

PMC ID: 11211839

PubMed ID: 38940140

SO-VID: d833a01d-c476-4ab9-bbbe-ca2724f3c8fa

License:

This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.

History

Page count

Pages: 12

Funding

Funded by: Schmidt Futures, DOI 10.13039/100027426;

Comments

Comment on this article

scite_

Smart Citations

Citing PublicationsSupportingMentioningContrasting

View Citations

See how this article has been cited at scite.ai

scite shows how a scientific paper has been cited by providing the context of the citation, a classification describing whether it supports, mentions, or contrasts the cited claim, and a label indicating in which section the citation was made.

Cited by 1

Steering veridical large language model analyses by correcting and enriching generated database queries: first steps toward ChatGPT bioinformatics
Authors: Olivier Cinquin

See all cited by

Most referenced authors 61

See all reference authors

BioCoder: a benchmark for bioinformatics code generation with large language models

Read this article at

Abstract

Summary

Availability and implementation

Related collections

Databases and Data Resources for Drug Repurposing (REPO4EU)

Most cited references 6

Language models are unsupervised multitask learners

A large-scale analysis of bioinformatics code on GitHub

An innovative approach for testing bioinformatics programs using metamorphic testing

Author and article information

Contributors

Journal

Affiliations

Author notes

Author information

Article

History

Page count

Funding

Categories

Comments

Comment on this article

Similar content 119

Cited by 1

Most referenced authors 61