0
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      Genetic Discovery Enabled by A Large Language Model

      Preprint
      research-article

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Artificial intelligence (AI) has been used in many areas of medicine, and recently large language models (LLMs) have shown potential utility for clinical applications. However, since we do not know if the use of LLMs can accelerate the pace of genetic discovery, we used data generated from mouse genetic models to investigate this possibility. We examined whether a recently developed specialized LLM (Med-PaLM 2) could analyze sets of candidate genes generated from analysis of murine models of biomedical traits. In response to free-text input, Med-PaLM 2 correctly identified the murine genes that contained experimentally verified causative genetic factors for six biomedical traits, which included susceptibility to diabetes and cataracts. Med-PaLM 2 was also able to analyze a list of genes with high impact alleles, which were identified by comparative analysis of murine genomic sequence data, and it identified a causative murine genetic factor for spontaneous hearing loss. Based upon this Med-PaLM 2 finding, a novel bigenic model for susceptibility to spontaneous hearing loss was developed. These results demonstrate Med-PaLM 2 can analyze gene-phenotype relationships and generate novel hypotheses, which can facilitate genetic discovery.

          Related collections

          Most cited references52

          • Record: found
          • Abstract: found
          • Article: found
          Is Open Access

          The Sequence Alignment/Map format and SAMtools

          Summary: The Sequence Alignment/Map (SAM) format is a generic alignment format for storing read alignments against reference sequences, supporting short and long reads (up to 128 Mbp) produced by different sequencing platforms. It is flexible in style, compact in size, efficient in random access and is the format in which alignments from the 1000 Genomes Project are released. SAMtools implements various utilities for post-processing alignments in the SAM format, such as indexing, variant caller and alignment viewer, and thus provides universal tools for processing read alignments. Availability: http://samtools.sourceforge.net Contact: rd@sanger.ac.uk
            Bookmark
            • Record: found
            • Abstract: found
            • Article: found
            Is Open Access

            Fast and accurate short read alignment with Burrows–Wheeler transform

            Motivation: The enormous amount of short reads generated by the new DNA sequencing technologies call for the development of fast and accurate read alignment programs. A first generation of hash table-based methods has been developed, including MAQ, which is accurate, feature rich and fast enough to align short reads from a single individual. However, MAQ does not support gapped alignment for single-end reads, which makes it unsuitable for alignment of longer reads where indels may occur frequently. The speed of MAQ is also a concern when the alignment is scaled up to the resequencing of hundreds of individuals. Results: We implemented Burrows-Wheeler Alignment tool (BWA), a new read alignment package that is based on backward search with Burrows–Wheeler Transform (BWT), to efficiently align short sequencing reads against a large reference sequence such as the human genome, allowing mismatches and gaps. BWA supports both base space reads, e.g. from Illumina sequencing machines, and color space reads from AB SOLiD machines. Evaluations on both simulated and real data suggest that BWA is ∼10–20× faster than MAQ, while achieving similar accuracy. In addition, BWA outputs alignment in the new standard SAM (Sequence Alignment/Map) format. Variant calling and other downstream analyses after the alignment can be achieved with the open source SAMtools software package. Availability: http://maq.sourceforge.net Contact: rd@sanger.ac.uk
              Bookmark
              • Record: found
              • Abstract: found
              • Article: found
              Is Open Access

              ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data

              High-throughput sequencing platforms are generating massive amounts of genetic variation data for diverse genomes, but it remains a challenge to pinpoint a small subset of functionally important variants. To fill these unmet needs, we developed the ANNOVAR tool to annotate single nucleotide variants (SNVs) and insertions/deletions, such as examining their functional consequence on genes, inferring cytogenetic bands, reporting functional importance scores, finding variants in conserved regions, or identifying variants reported in the 1000 Genomes Project and dbSNP. ANNOVAR can utilize annotation databases from the UCSC Genome Browser or any annotation data set conforming to Generic Feature Format version 3 (GFF3). We also illustrate a ‘variants reduction’ protocol on 4.7 million SNVs and indels from a human genome, including two causal mutations for Miller syndrome, a rare recessive disease. Through a stepwise procedure, we excluded variants that are unlikely to be causal, and identified 20 candidate genes including the causal gene. Using a desktop computer, ANNOVAR requires ∼4 min to perform gene-based annotation and ∼15 min to perform variants reduction on 4.7 million variants, making it practical to handle hundreds of human genomes in a day. ANNOVAR is freely available at http://www.openbioinformatics.org/annovar/ .
                Bookmark

                Author and article information

                Journal
                bioRxiv
                BIORXIV
                bioRxiv
                Cold Spring Harbor Laboratory
                12 November 2023
                : 2023.11.09.566468
                Affiliations
                [1 ]Google Research, Mountain View, CA, USA
                [2 ]Department of Anesthesiology, Pain and Perioperative Medicine
                [3 ]Department of Otolaryngology – Head and Neck Surgery, Stanford University School of Medicine, Stanford, CA 94305, USA
                Author notes
                []Corresponding authors: { gpeltz@ 123456stanford.edu , taotu@ 123456google.com }
                [*]

                Equal contributions.

                Author contributions

                GP and KS formulated the project. TT and GP wrote the paper with input from all authors. TT, FZ, SS, and ZC generated experimental data. TT, FZ, and GP analyzed the data. TT, AP, and VN developed the LLM and the techniques for enabling its genetic discovery applications. All authors have read and approved of the manuscript.

                Author information
                http://orcid.org/0000-0003-0233-279X
                http://orcid.org/0000-0001-6191-7697
                Article
                10.1101/2023.11.09.566468
                10659415
                37986848
                24aa4811-d60e-4ccd-bacf-d3202b15a8ea

                This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which allows reusers to copy and distribute the material in any medium or format in unadapted form only, for noncommercial purposes only, and only so long as attribution is given to the creator.

                History
                Categories
                Article

                Comments

                Comment on this article