ProteinGym: Large-Scale Benchmarks for Protein Design and Fitness Prediction

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

Predicting the effects of mutations in proteins is critical to many applications, from understanding genetic disease to designing novel proteins that can address our most pressing challenges in climate, agriculture and healthcare. Despite a surge in machine learning-based protein models to tackle these questions, an assessment of their respective benefits is challenging due to the use of distinct, often contrived, experimental datasets, and the variable performance of models across different protein families. Addressing these challenges requires scale. To that end we introduce ProteinGym, a large-scale and holistic set of benchmarks specifically designed for protein fitness prediction and design. It encompasses both a broad collection of over 250 standardized deep mutational scanning assays, spanning millions of mutated sequences, as well as curated clinical datasets providing high-quality expert annotations about mutation effects. We devise a robust evaluation framework that combines metrics for both fitness prediction and design, factors in known limitations of the underlying experimental methods, and covers both zero-shot and supervised settings. We report the performance of a diverse set of over 70 high-performing models from various subfields (eg., alignment-based, inverse folding) into a unified benchmark suite. We open source the corresponding codebase, datasets, MSAs, structures, model predictions and develop a user-friendly website that facilitates data access and analysis.

Related collections

Most cited references 205

Record: found
Abstract: found
Article: found

Is Open Access

Highly accurate protein structure prediction with AlphaFold

John Jumper, Richard Evans, Alexander Pritzel … (2021)

Proteins are essential to life, and understanding their structure can facilitate a mechanistic understanding of their function. Through an enormous experimental effort 1 – 4 , the structures of around 100,000 unique proteins have been determined 5 , but this represents a small fraction of the billions of known protein sequences 6 , 7 . Structural coverage is bottlenecked by the months to years of painstaking effort required to determine a single protein structure. Accurate computational approaches are needed to address this gap and to enable large-scale structural bioinformatics. Predicting the three-dimensional structure that a protein will adopt based solely on its amino acid sequence—the structure prediction component of the ‘protein folding problem’ 8 —has been an important open research problem for more than 50 years 9 . Despite recent progress 10 – 14 , existing methods fall far short of atomic accuracy, especially when no homologous structure is available. Here we provide the first computational method that can regularly predict protein structures with atomic accuracy even in cases in which no similar structure is known. We validated an entirely redesigned version of our neural network-based model, AlphaFold, in the challenging 14th Critical Assessment of protein Structure Prediction (CASP14) 15 , demonstrating accuracy competitive with experimental structures in a majority of cases and greatly outperforming other methods. Underpinning the latest version of AlphaFold is a novel machine learning approach that incorporates physical and biological knowledge about protein structure, leveraging multi-sequence alignments, into the design of the deep learning algorithm. AlphaFold predicts protein structures with an accuracy competitive with experimental structures in the majority of cases using a novel deep learning architecture.

0 comments Cited 10841 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: found

Is Open Access

The mutational constraint spectrum quantified from variation in 141,456 humans

Konrad J. Karczewski, Laurent C. Francioli, Grace Tiao … (2021)

Genetic variants that inactivate protein-coding genes are a powerful source of information about the phenotypic consequences of gene disruption: genes that are crucial for the function of an organism will be depleted of such variants in natural populations, whereas non-essential genes will tolerate their accumulation. However, predicted loss-of-function variants are enriched for annotation errors, and tend to be found at extremely low frequencies, so their analysis requires careful variant annotation and very large sample sizes 1 . Here we describe the aggregation of 125,748 exomes and 15,708 genomes from human sequencing studies into the Genome Aggregation Database (gnomAD). We identify 443,769 high-confidence predicted loss-of-function variants in this cohort after filtering for artefacts caused by sequencing and annotation errors. Using an improved model of human mutation rates, we classify human protein-coding genes along a spectrum that represents tolerance to inactivation, validate this classification using data from model organisms and engineered human cells, and show that it can be used to improve the power of gene discovery for both common and rare diseases.

0 comments Cited 3700 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

A method and server for predicting damaging missense mutations

Ivan Adzhubei, Steffen Schmidt, Leonid Peshkin … (2010)

To the Editor: Applications of rapidly advancing sequencing technologies exacerbate the need to interpret individual sequence variants. Sequencing of phenotyped clinical subjects will soon become a method of choice in studies of the genetic causes of Mendelian and complex diseases. New exon capture techniques will direct sequencing efforts towards the most informative and easily interpretable protein-coding fraction of the genome. Thus, the demand for computational predictions of the impact of protein sequence variants will continue to grow. Here we present a new method and the corresponding software tool, PolyPhen-2 (http://genetics.bwh.harvard.edu/pph2/), which is different from the early tool PolyPhen1 in the set of predictive features, alignment pipeline, and the method of classification (Fig. 1a). PolyPhen-2 uses eight sequence-based and three structure-based predictive features (Supplementary Table 1) which were selected automatically by an iterative greedy algorithm (Supplementary Methods). Majority of these features involve comparison of a property of the wild-type (ancestral, normal) allele and the corresponding property of the mutant (derived, disease-causing) allele, which together define an amino acid replacement. Most informative features characterize how well the two human alleles fit into the pattern of amino acid replacements within the multiple sequence alignment of homologous proteins, how distant the protein harboring the first deviation from the human wild-type allele is from the human protein, and whether the mutant allele originated at a hypermutable site2. The alignment pipeline selects the set of homologous sequences for the analysis using a clustering algorithm and then constructs and refines their multiple alignment (Supplementary Fig. 1). The functional significance of an allele replacement is predicted from its individual features (Supplementary Figs. 2–4) by Naïve Bayes classifier (Supplementary Methods). We used two pairs of datasets to train and test PolyPhen-2. We compiled the first pair, HumDiv, from all 3,155 damaging alleles with known effects on the molecular function causing human Mendelian diseases, present in the UniProt database, together with 6,321 differences between human proteins and their closely related mammalian homologs, assumed to be non-damaging (Supplementary Methods). The second pair, HumVar3, consists of all the 13,032 human disease-causing mutations from UniProt, together with 8,946 human nsSNPs without annotated involvement in disease, which were treated as non-damaging. We found that PolyPhen-2 performance, as presented by its receiver operating characteristic curves, was consistently superior compared to PolyPhen (Fig. 1b) and it also compared favorably with the three other popular prediction tools4–6 (Fig. 1c). For a false positive rate of 20%, PolyPhen-2 achieves the rate of true positive predictions of 92% and 73% on HumDiv and HumVar, respectively (Supplementary Table 2). One reason for a lower accuracy of predictions on HumVar is that nsSNPs assumed to be non-damaging in HumVar contain a sizable fraction of mildly deleterious alleles. In contrast, most of amino acid replacements assumed non-damaging in HumDiv must be close to selective neutrality. Because alleles that are even mildly but unconditionally deleterious cannot be fixed in the evolving lineage, no method based on comparative sequence analysis is ideal for discriminating between drastically and mildly deleterious mutations, which are assigned to the opposite categories in HumVar. Another reason is that HumDiv uses an extra criterion to avoid possible erroneous annotations of damaging mutations. For a mutation, PolyPhen-2 calculates Naïve Bayes posterior probability that this mutation is damaging and reports estimates of false positive (the chance that the mutation is classified as damaging when it is in fact non-damaging) and true positive (the chance that the mutation is classified as damaging when it is indeed damaging) rates. A mutation is also appraised qualitatively, as benign, possibly damaging, or probably damaging (Supplementary Methods). The user can choose between HumDiv- and HumVar-trained PolyPhen-2. Diagnostics of Mendelian diseases requires distinguishing mutations with drastic effects from all the remaining human variation, including abundant mildly deleterious alleles. Thus, HumVar-trained PolyPhen-2 should be used for this task. In contrast, HumDiv-trained PolyPhen-2 should be used for evaluating rare alleles at loci potentially involved in complex phenotypes, dense mapping of regions identified by genome-wide association studies, and analysis of natural selection from sequence data, where even mildly deleterious alleles must be treated as damaging. Supplementary Material 1

0 comments Cited 2029 times – based on 0 reviews      Review now

Bookmark

All references

Author and article information

Contributors

Pascal Notin:

ORCID: http://orcid.org/0000-0002-1877-8983

Aaron W. Kollasch:

ORCID: http://orcid.org/0000-0001-9733-8822

Daniel Ritter:

ORCID: http://orcid.org/0009-0009-3266-9917

Lood van Niekerk:

ORCID: http://orcid.org/0000-0001-9082-2574

Steffanie Paul:

ORCID: http://orcid.org/0000-0001-7306-4863

Hansen Spinner

Nathan Rollins:

ORCID: http://orcid.org/0000-0002-8037-6045

Ada Shaw:

ORCID: http://orcid.org/0000-0002-5283-9559

Ruben Weitzman:

ORCID: http://orcid.org/0000-0001-5882-2005

Jonathan Frazer:

ORCID: http://orcid.org/0000-0001-6900-6484

Mafalda Dias:

ORCID: http://orcid.org/0000-0002-1804-8542

Dinko Franceschi

Rose Orenbuch:

ORCID: http://orcid.org/0000-0002-4678-0837

Yarin Gal:

ORCID: http://orcid.org/0000-0002-2733-2078

Debora S. Marks:

ORCID: http://orcid.org/0000-0001-9388-2281

Journal

Journal ID (nlm-ta): bioRxiv

Journal ID (publisher-id): BIORXIV

Title: bioRxiv

Publisher: Cold Spring Harbor Laboratory

Publication date (Electronic): 08 December 2023

Electronic Location Identifier: 2023.12.07.570727

Affiliations

Computer Science, University of Oxford

Systems Biology, Harvard Medical School

Seismic Therapeutic

Applied Mathematics, Harvard University

Computer Science, University of Oxford

Centre for Genomic Regulation, Universitat Pompeu Fabra

Systems Biology, Harvard Medical School

Computer Science, University of Oxford

Harvard Medical School, Broad Institute

Author notes

[†]

Equal contribution

[* ]Correspondence: pascal.notin@ 123456cs.ox.ac.uk , debbie@ 123456hms.harvard.edu ;

Author information

Pascal Notin http://orcid.org/0000-0002-1877-8983

Aaron W. Kollasch http://orcid.org/0000-0001-9733-8822

Daniel Ritter http://orcid.org/0009-0009-3266-9917

Lood van Niekerk http://orcid.org/0000-0001-9082-2574

Steffanie Paul http://orcid.org/0000-0001-7306-4863

Nathan Rollins http://orcid.org/0000-0002-8037-6045

Ada Shaw http://orcid.org/0000-0002-5283-9559

Ruben Weitzman http://orcid.org/0000-0001-5882-2005

Jonathan Frazer http://orcid.org/0000-0001-6900-6484

Mafalda Dias http://orcid.org/0000-0002-1804-8542

Rose Orenbuch http://orcid.org/0000-0002-4678-0837

Yarin Gal http://orcid.org/0000-0002-2733-2078

Debora S. Marks http://orcid.org/0000-0001-9388-2281

Article

DOI: 10.1101/2023.12.07.570727

PMC ID: 10723403

PubMed ID: 38106144

SO-VID: 4a5f9b73-f7d7-4849-9a65-1ce8b58a9ded

License:

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which allows reusers to copy and distribute the material in any medium or format in unadapted form only, for noncommercial purposes only, and only so long as attribution is given to the creator.

ProteinGym: Large-Scale Benchmarks for Protein Design and Fitness Prediction

Read this article at

Abstract

Related collections

Journal of Circulating Biomarkers

Most cited references 205

Highly accurate protein structure prediction with AlphaFold

The mutational constraint spectrum quantified from variation in 141,456 humans

A method and server for predicting damaging missense mutations

Author and article information

Contributors

Journal

Affiliations

Author notes

Author information

Article

History

Categories

Comments

Comment on this article

Similar content 199

Cited by 6

Most referenced authors 5,054