BepiPred‐3.0: Improved B‐cell epitope prediction using protein language models

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

B‐cell epitope prediction tools are of great medical and commercial interest due to their practical applications in vaccine development and disease diagnostics. The introduction of protein language models (LMs), trained on unprecedented large datasets of protein sequences and structures, tap into a powerful numeric representation that can be exploited to accurately predict local and global protein structural features from amino acid sequences only. In this paper, we present BepiPred‐3.0, a sequence‐based epitope prediction tool that, by exploiting LM embeddings, greatly improves the prediction accuracy for both linear and conformational epitope prediction on several independent test sets. Furthermore, by carefully selecting additional input variables and epitope residue annotation strategy, performance was further improved, thus achieving unprecedented predictive power. Our tool can predict epitopes across hundreds of sequences in minutes. It is freely available as a web server and a standalone package at https://services.healthtech.dtu.dk/service.php?BepiPred-3.0 with a user‐friendly interface to navigate the results.

Related collections

Most cited references 33

Record: found
Abstract: found
Article: found

Is Open Access

Highly accurate protein structure prediction with AlphaFold

John Jumper, Richard Evans, Alexander Pritzel … (2021)

Proteins are essential to life, and understanding their structure can facilitate a mechanistic understanding of their function. Through an enormous experimental effort 1 – 4 , the structures of around 100,000 unique proteins have been determined 5 , but this represents a small fraction of the billions of known protein sequences 6 , 7 . Structural coverage is bottlenecked by the months to years of painstaking effort required to determine a single protein structure. Accurate computational approaches are needed to address this gap and to enable large-scale structural bioinformatics. Predicting the three-dimensional structure that a protein will adopt based solely on its amino acid sequence—the structure prediction component of the ‘protein folding problem’ 8 —has been an important open research problem for more than 50 years 9 . Despite recent progress 10 – 14 , existing methods fall far short of atomic accuracy, especially when no homologous structure is available. Here we provide the first computational method that can regularly predict protein structures with atomic accuracy even in cases in which no similar structure is known. We validated an entirely redesigned version of our neural network-based model, AlphaFold, in the challenging 14th Critical Assessment of protein Structure Prediction (CASP14) 15 , demonstrating accuracy competitive with experimental structures in a majority of cases and greatly outperforming other methods. Underpinning the latest version of AlphaFold is a novel machine learning approach that incorporates physical and biological knowledge about protein structure, leveraging multi-sequence alignments, into the design of the deep learning algorithm. AlphaFold predicts protein structures with an accuracy competitive with experimental structures in the majority of cases using a novel deep learning architecture.

0 comments Cited 10850 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

MUSCLE: multiple sequence alignment with high accuracy and high throughput.

R. C. Edgar (2004)

We describe MUSCLE, a new computer program for creating multiple alignments of protein sequences. Elements of the algorithm include fast distance estimation using kmer counting, progressive alignment using a new profile function we call the log-expectation score, and refinement using tree-dependent restricted partitioning. The speed and accuracy of MUSCLE are compared with T-Coffee, MAFFT and CLUSTALW on four test sets of reference alignments: BAliBASE, SABmark, SMART and a new benchmark, PREFAB. MUSCLE achieves the highest, or joint highest, rank in accuracy on each of these sets. Without refinement, MUSCLE achieves average accuracy statistically indistinguishable from T-Coffee and MAFFT, and is the fastest of the tested methods for large numbers of sequences, aligning 5000 sequences of average length 350 in 7 min on a current desktop computer. The MUSCLE program, source code and PREFAB test data are freely available at http://www.drive5. com/muscle.

0 comments Cited 6696 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

The Protein Data Bank.

H M Berman, J Westbrook, Z Feng … (2000)

The Protein Data Bank (PDB; http://www.rcsb.org/pdb/ ) is the single worldwide archive of structural data of biological macromolecules. This paper describes the goals of the PDB, the systems in place for data deposition and access, how to obtain further information, and near-term plans for the future development of the resource.

0 comments Cited 4114 times – based on 0 reviews      Review now

Bookmark

All references

Author and article information

Contributors

Joakim Nøddeskov Clifford:

ORCID: https://orcid.org/0000-0002-8126-9209

cliffordjoakim@gmail.com

Journal

Journal ID (nlm-ta): Protein Sci

Journal ID (iso-abbrev): Protein Sci

Journal ID (doi): 10.1002/(ISSN)1469-896X

Journal ID (publisher-id): PRO

Title: Protein Science : A Publication of the Protein Society

Publisher: John Wiley & Sons, Inc. (Hoboken, USA )

ISSN (Print): 0961-8368

ISSN (Electronic): 1469-896X

Publication date (Print): December 2022

Publication date PMC-release: December 2022

Volume: 31

Issue: 12 ( doiID: 10.1002/pro.v31.12 )

Electronic Location Identifier: e4497

Affiliations

[ ¹ ] Department of Health Technology Technical University of Denmark Kongens Lyngby Denmark

[ ² ] La Jolla Institute for Immunology La Jolla California USA

Author notes

[*] [* ] Correspondence

Joakim Nøddeskov Clifford, Department of Health Technology, Technical University of Denmark, Kongens Lyngby 2800, Denmark.

Email: cliffordjoakim@ 123456gmail.com

Author information

Joakim Nøddeskov Clifford https://orcid.org/0000-0002-8126-9209

Article

Publisher ID: PRO4497

DOI: 10.1002/pro.4497

PMC ID: 9679979

PubMed ID: 36366745

SO-VID: 0fb9c5be-2a45-4276-9c5c-7a339eeb1238

License:

This is an open access article under the terms of the http://creativecommons.org/licenses/by-nc/4.0/ License, which permits use, distribution and reproduction in any medium, provided the original work is properly cited and is not used for commercial purposes.

History

Date revision received : 31 October 2022

Date received : 28 July 2022

Date accepted : 01 November 2022

Page count

Figures: 3, Tables: 6, Pages: 11, Words: 6726

Custom metadata

source-schema-version-number 2.0

cover-date December 2022

details-of-publishers-convertor Converter:WILEY_ML3GV2_TO_JATSPMC version:6.2.1 mode:remove_FC converted:22.11.2022

ScienceOpen disciplines: Biochemistry

Keywords: bepipred‐3.0,bepipred,b‐cell epitope prediction,protein language model,machine learning,deep learning,immunology,b‐cell epitopes,bioinformatics,immunoinformatics

Data availability:

ScienceOpen disciplines: Biochemistry

Keywords: bepipred‐3.0, bepipred, b‐cell epitope prediction, protein language model, machine learning, deep learning, immunology, b‐cell epitopes, bioinformatics, immunoinformatics

BepiPred‐3.0: Improved B‐cell epitope prediction using protein language models

Read this article at

Abstract

Related collections

Drug_transporters

Most cited references 33

Highly accurate protein structure prediction with AlphaFold

MUSCLE: multiple sequence alignment with high accuracy and high throughput.

The Protein Data Bank.

Author and article information

Contributors

Journal

Affiliations

Author notes

Author information

Article

History

Page count

Categories

Custom metadata

Comments

Comment on this article

Similar content 232

Cited by 30

Most referenced authors 1,095