4
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      BioCoder: a benchmark for bioinformatics code generation with large language models

      research-article

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Summary

          Pretrained large language models (LLMs) have significantly improved code generation. As these models scale up, there is an increasing need for the output to handle more intricate tasks and to be appropriately specialized to particular domains. Here, we target bioinformatics due to the amount of domain knowledge, algorithms, and data operations this discipline requires. We present BioCoder, a benchmark developed to evaluate LLMs in generating bioinformatics-specific code. BioCoder spans much of the field, covering cross-file dependencies, class declarations, and global variables. It incorporates 1026 Python functions and 1243 Java methods extracted from GitHub, along with 253 examples from the Rosalind Project, all pertaining to bioinformatics. Using topic modeling, we show that the overall coverage of the included code is representative of the full spectrum of bioinformatics calculations. BioCoder incorporates a fuzz-testing framework for evaluation. We have applied it to evaluate various models including InCoder, CodeGen, CodeGen2, SantaCoder, StarCoder, StarCoder+, InstructCodeT5+, GPT-3.5, and GPT-4. Furthermore, we fine-tuned one model (StarCoder), demonstrating that our training dataset can enhance the performance on our testing benchmark (by >15% in terms of Pass@K under certain prompt configurations and always >3%). The results highlight two key aspects of successful models: (i) Successful models accommodate a long prompt (>2600 tokens) with full context, including functional dependencies. (ii) They contain domain-specific knowledge of bioinformatics, beyond just general coding capability. This is evident from the performance gain of GPT-3.5/4 compared to the smaller models on our benchmark (50% versus up to 25%).

          Availability and implementation

          All datasets, benchmark, Docker images, and scripts required for testing are available at: https://github.com/gersteinlab/biocoder and https://biocoder-benchmark.github.io/.

          Related collections

          Most cited references6

          • Record: found
          • Abstract: not found
          • Conference Proceedings: not found

          Language models are unsupervised multitask learners

            Bookmark
            • Record: found
            • Abstract: found
            • Article: found
            Is Open Access

            A large-scale analysis of bioinformatics code on GitHub

            In recent years, the explosion of genomic data and bioinformatic tools has been accompanied by a growing conversation around reproducibility of results and usability of software. However, the actual state of the body of bioinformatics software remains largely unknown. The purpose of this paper is to investigate the state of source code in the bioinformatics community, specifically looking at relationships between code properties, development activity, developer communities, and software impact. To investigate these issues, we curated a list of 1,720 bioinformatics repositories on GitHub through their mention in peer-reviewed bioinformatics articles. Additionally, we included 23 high-profile repositories identified by their popularity in an online bioinformatics forum. We analyzed repository metadata, source code, development activity, and team dynamics using data made available publicly through the GitHub API, as well as article metadata. We found key relationships within our dataset, including: certain scientific topics are associated with more active code development and higher community interest in the repository; most of the code in the main dataset is written in dynamically typed languages, while most of the code in the high-profile set is statically typed; developer team size is associated with community engagement and high-profile repositories have larger teams; the proportion of female contributors decreases for high-profile repositories and with seniority level in author lists; and, multiple measures of project impact are associated with the simple variable of whether the code was modified at all after paper publication. In addition to providing the first large-scale analysis of bioinformatics code to our knowledge, our work will enable future analysis through publicly available data, code, and methods. Code to generate the dataset and reproduce the analysis is provided under the MIT license at https://github.com/pamelarussell/github-bioinformatics. Data are available at https://doi.org/10.17605/OSF.IO/UWHX8.
              Bookmark
              • Record: found
              • Abstract: found
              • Article: found
              Is Open Access

              An innovative approach for testing bioinformatics programs using metamorphic testing

              Background Recent advances in experimental and computational technologies have fueled the development of many sophisticated bioinformatics programs. The correctness of such programs is crucial as incorrectly computed results may lead to wrong biological conclusion or misguide downstream experimentation. Common software testing procedures involve executing the target program with a set of test inputs and then verifying the correctness of the test outputs. However, due to the complexity of many bioinformatics programs, it is often difficult to verify the correctness of the test outputs. Therefore our ability to perform systematic software testing is greatly hindered. Results We propose to use a novel software testing technique, metamorphic testing (MT), to test a range of bioinformatics programs. Instead of requiring a mechanism to verify whether an individual test output is correct, the MT technique verifies whether a pair of test outputs conform to a set of domain specific properties, called metamorphic relations (MRs), thus greatly increases the number and variety of test cases that can be applied. To demonstrate how MT is used in practice, we applied MT to test two open-source bioinformatics programs, namely GNLab and SeqMap. In particular we show that MT is simple to implement, and is effective in detecting faults in a real-life program and some artificially fault-seeded programs. Further, we discuss how MT can be applied to test programs from various domains of bioinformatics. Conclusion This paper describes the application of a simple, effective and automated technique to systematically test a range of bioinformatics programs. We show how MT can be implemented in practice through two real-life case studies. Since many bioinformatics programs, particularly those for large scale simulation and data analysis, are hard to test systematically, their developers may benefit from using MT as part of the testing strategy. Therefore our work represents a significant step towards software reliability in bioinformatics.
                Bookmark

                Author and article information

                Contributors
                Journal
                Bioinformatics
                Bioinformatics
                bioinformatics
                Bioinformatics
                Oxford University Press
                1367-4803
                1367-4811
                July 2024
                28 June 2024
                28 June 2024
                : 40
                : Suppl 1 , ISMB 2024 Proceedings
                : i266-i276
                Affiliations
                Department of Computer Science, Yale University , New Haven, CT 06520, United States
                Department of Computer Science, Yale University , New Haven, CT 06520, United States
                Department of Computer Science, Yale University , New Haven, CT 06520, United States
                Department of Computer Science, Yale University , New Haven, CT 06520, United States
                Google Deepmind , Mountain View, CA 94043, United States
                Department of Computer Science, Yale University , New Haven, CT 06520, United States
                Program in Computational Biology & Bioinformatics, Yale University , New Haven, CT 06520, United States
                Department of Molecular Biophysics & Biochemistry, Yale University , New Haven, CT 06520, United States
                Department of Statistics & Data Science, Yale University , New Haven, CT 06520, United States
                Department of Biomedical Informatics & Data Science, Yale University , New Haven, CT 06520, United States
                Author notes
                Corresponding author. Department of Computer Science, Yale University, New Haven, CT 06520, United States. E-mail: mark@ 123456gersteinlab.org (M.B.G.)
                Author information
                https://orcid.org/0000-0002-9746-3719
                Article
                btae230
                10.1093/bioinformatics/btae230
                11211839
                38940140
                d833a01d-c476-4ab9-bbbe-ca2724f3c8fa
                © The Author(s) 2024. Published by Oxford University Press.

                This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.

                History
                Page count
                Pages: 12
                Funding
                Funded by: Schmidt Futures, DOI 10.13039/100027426;
                Categories
                General Computational Biology
                AcademicSubjects/SCI01060

                Bioinformatics & Computational biology
                Bioinformatics & Computational biology

                Comments

                Comment on this article