11
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      Genome-wide prediction of disease variant effects with a deep protein language model

      research-article

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Predicting the effects of coding variants is a major challenge. While recent deep-learning models have improved variant effect prediction accuracy, they cannot analyze all coding variants due to dependency on close homologs or software limitations. Here we developed a workflow using ESM1b, a 650-million-parameter protein language model, to predict all ~450 million possible missense variant effects in the human genome, and made all predictions available on a web portal. ESM1b outperformed existing methods in classifying ~150,000 ClinVar/HGMD missense variants as pathogenic or benign and predicting measurements across 28 deep mutational scan datasets. We further annotated ~2 million variants as damaging only in specific protein isoforms, demonstrating the importance of considering all isoforms when predicting variant effects. Our approach also generalizes to more complex coding variants such as in-frame indels and stop-gains. Together, these results establish protein language models as an effective, accurate and general approach to predicting variant effects.

          Abstract

          A modified framework leveraging a protein language model (ESM1b) is used to predict all possible 450 million missense variant effects in the human genome and shows potential for generalizing to more complex genetic variations such as indels and stop-gains.

          Related collections

          Most cited references68

          • Record: found
          • Abstract: found
          • Article: found
          Is Open Access

          Highly accurate protein structure prediction with AlphaFold

          Proteins are essential to life, and understanding their structure can facilitate a mechanistic understanding of their function. Through an enormous experimental effort 1 – 4 , the structures of around 100,000 unique proteins have been determined 5 , but this represents a small fraction of the billions of known protein sequences 6 , 7 . Structural coverage is bottlenecked by the months to years of painstaking effort required to determine a single protein structure. Accurate computational approaches are needed to address this gap and to enable large-scale structural bioinformatics. Predicting the three-dimensional structure that a protein will adopt based solely on its amino acid sequence—the structure prediction component of the ‘protein folding problem’ 8 —has been an important open research problem for more than 50 years 9 . Despite recent progress 10 – 14 , existing methods fall far short of atomic accuracy, especially when no homologous structure is available. Here we provide the first computational method that can regularly predict protein structures with atomic accuracy even in cases in which no similar structure is known. We validated an entirely redesigned version of our neural network-based model, AlphaFold, in the challenging 14th Critical Assessment of protein Structure Prediction (CASP14) 15 , demonstrating accuracy competitive with experimental structures in a majority of cases and greatly outperforming other methods. Underpinning the latest version of AlphaFold is a novel machine learning approach that incorporates physical and biological knowledge about protein structure, leveraging multi-sequence alignments, into the design of the deep learning algorithm. AlphaFold predicts protein structures with an accuracy competitive with experimental structures in the majority of cases using a novel deep learning architecture.
            Bookmark
            • Record: found
            • Abstract: found
            • Article: not found

            Standards and Guidelines for the Interpretation of Sequence Variants: A Joint Consensus Recommendation of the American College of Medical Genetics and Genomics and the Association for Molecular Pathology

            The American College of Medical Genetics and Genomics (ACMG) previously developed guidance for the interpretation of sequence variants. 1 In the past decade, sequencing technology has evolved rapidly with the advent of high-throughput next generation sequencing. By adopting and leveraging next generation sequencing, clinical laboratories are now performing an ever increasing catalogue of genetic testing spanning genotyping, single genes, gene panels, exomes, genomes, transcriptomes and epigenetic assays for genetic disorders. By virtue of increased complexity, this paradigm shift in genetic testing has been accompanied by new challenges in sequence interpretation. In this context, the ACMG convened a workgroup in 2013 comprised of representatives from the ACMG, the Association for Molecular Pathology (AMP) and the College of American Pathologists (CAP) to revisit and revise the standards and guidelines for the interpretation of sequence variants. The group consisted of clinical laboratory directors and clinicians. This report represents expert opinion of the workgroup with input from ACMG, AMP and CAP stakeholders. These recommendations primarily apply to the breadth of genetic tests used in clinical laboratories including genotyping, single genes, panels, exomes and genomes. This report recommends the use of specific standard terminology: ‘pathogenic’, ‘likely pathogenic’, ‘uncertain significance’, ‘likely benign’, and ‘benign’ to describe variants identified in Mendelian disorders. Moreover, this recommendation describes a process for classification of variants into these five categories based on criteria using typical types of variant evidence (e.g. population data, computational data, functional data, segregation data, etc.). Because of the increased complexity of analysis and interpretation of clinical genetic testing described in this report, the ACMG strongly recommends that clinical molecular genetic testing should be performed in a CLIA-approved laboratory with results interpreted by a board-certified clinical molecular geneticist or molecular genetic pathologist or equivalent.
              Bookmark
              • Record: found
              • Abstract: found
              • Article: not found
              Is Open Access

              The Genotype-Tissue Expression (GTEx) project.

              Genome-wide association studies have identified thousands of loci for common diseases, but, for the majority of these, the mechanisms underlying disease susceptibility remain unknown. Most associated variants are not correlated with protein-coding changes, suggesting that polymorphisms in regulatory regions probably contribute to many disease phenotypes. Here we describe the Genotype-Tissue Expression (GTEx) project, which will establish a resource database and associated tissue bank for the scientific community to study the relationship between genetic variation and gene expression in human tissues.
                Bookmark

                Author and article information

                Contributors
                jimmie.ye@ucsf.edu
                vasilis.ntranos@ucsf.edu
                Journal
                Nat Genet
                Nat Genet
                Nature Genetics
                Nature Publishing Group US (New York )
                1061-4036
                1546-1718
                10 August 2023
                10 August 2023
                2023
                : 55
                : 9
                : 1512-1522
                Affiliations
                [1 ]GRID grid.266102.1, ISNI 0000 0001 2297 6811, Division of Rheumatology, Department of Medicine, , University of California, San Francisco, ; San Francisco, CA USA
                [2 ]GRID grid.266102.1, ISNI 0000 0001 2297 6811, Biological and Medical Informatics Graduate Program, , University of California, San Francisco, ; San Francisco, CA USA
                [3 ]GRID grid.266102.1, ISNI 0000 0001 2297 6811, Biomedical Sciences Graduate Program, , University of California, San Francisco, ; San Francisco, CA USA
                [4 ]GRID grid.266102.1, ISNI 0000 0001 2297 6811, Bakar Computational Health Sciences Institute, , University of California, San Francisco, ; San Francisco, CA USA
                [5 ]GRID grid.266102.1, ISNI 0000 0001 2297 6811, Parker Institute for Cancer Immunotherapy, , University of California, San Francisco, ; San Francisco, CA USA
                [6 ]GRID grid.266102.1, ISNI 0000 0001 2297 6811, Gladstone-UCSF Institute of Genomic Immunology, ; San Francisco, CA USA
                [7 ]GRID grid.266102.1, ISNI 0000 0001 2297 6811, Institute for Human Genetics, , University of California, San Francisco, ; San Francisco, CA USA
                [8 ]GRID grid.266102.1, ISNI 0000 0001 2297 6811, Department of Epidemiology & Biostatistics, , University of California, San Francisco, ; San Francisco, CA USA
                [9 ]GRID grid.266102.1, ISNI 0000 0001 2297 6811, Department of Bioengineering and Therapeutic Sciences, , University of California, San Francisco, ; San Francisco, CA USA
                [10 ]GRID grid.266102.1, ISNI 0000 0001 2297 6811, Diabetes Center, , University of California, San Francisco, ; San Francisco, CA USA
                Author information
                http://orcid.org/0000-0002-0510-2546
                http://orcid.org/0000-0001-7588-2077
                http://orcid.org/0000-0001-6560-3783
                http://orcid.org/0000-0002-2477-0670
                Article
                1465
                10.1038/s41588-023-01465-0
                10484790
                37563329
                57d6470b-2076-4b9e-8afb-37a40822510c
                © The Author(s) 2023

                Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

                History
                : 8 August 2022
                : 5 July 2023
                Categories
                Article
                Custom metadata
                © Springer Nature America, Inc. 2023

                Genetics
                functional genomics,bioinformatics
                Genetics
                functional genomics, bioinformatics

                Comments

                Comment on this article