SHINE: protein language model-based pathogenicity prediction for short inframe insertion and deletion variants.

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

Accurate variant pathogenicity predictions are important in genetic studies of human diseases. Inframe insertion and deletion variants (indels) alter protein sequence and length, but not as deleterious as frameshift indels. Inframe indel Interpretation is challenging due to limitations in the available number of known pathogenic variants for training. Existing prediction methods largely use manually encoded features including conservation, protein structure and function, and allele frequency to infer variant pathogenicity. Recent advances in deep learning modeling of protein sequences and structures provide an opportunity to improve the representation of salient features based on large numbers of protein sequences. We developed a new pathogenicity predictor for SHort Inframe iNsertion and dEletion (SHINE). SHINE uses pretrained protein language models to construct a latent representation of an indel and its protein context from protein sequences and multiple protein sequence alignments, and feeds the latent representation into supervised machine learning models for pathogenicity prediction. We curated training data from ClinVar and gnomAD, and created two test datasets from different sources. SHINE achieved better prediction performance than existing methods for both deletion and insertion variants in these two test datasets. Our work suggests that unsupervised protein language models can provide valuable information about proteins, and new methods based on these models can improve variant interpretation in genetic analyses.

Related collections

Author and article information

Journal

Journal ID (iso-abbrev): Brief Bioinform

Title: Briefings in bioinformatics

Publisher: Oxford University Press (OUP)

ISSN (Electronic): 1477-4054

ISSN (Print): 1467-5463

Publication date (Electronic): Jan 19 2023

Volume: 24

Issue: 1

Affiliations

[1 ] Department of Pediatrics, Columbia University, New York, NY, USA.

[2 ] Department of Systems Biology, Columbia University, New York, NY, USA.

[3 ] Department of Biomedical Informatics, Columbia University, New York, NY, USA.

[4 ] Lynbrook High School, San Jose, CA, USA.

[5 ] Department of Medicine, Columbia University, New York, NY, USA.

[6 ] JP Sulzberger Columbia Genome Center, Columbia University, New York, NY, USA.

Article

Publisher Item ID: 6961792

DOI: 10.1093/bib/bbac584

PMC ID: 9851320

PubMed ID: 36575831

SO-VID: 8a27fda9-d551-400a-82c3-ea9e85311f79

History

Keywords: protein language model,variant pathogenicity,transformer,transfer learning,inframe indel

Data availability:

Keywords: protein language model, variant pathogenicity, transformer, transfer learning, inframe indel

Comments

Comment on this article

scite_

Smart Citations

Citing PublicationsSupportingMentioningContrasting

View Citations

See how this article has been cited at scite.ai

scite shows how a scientific paper has been cited by providing the context of the citation, a classification describing whether it supports, mentions, or contrasts the cited claim, and a label indicating in which section the citation was made.

Cited by 3

See all cited by