8
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: not found

      SHINE: protein language model-based pathogenicity prediction for short inframe insertion and deletion variants.

      Read this article at

      ScienceOpenPublisherPMC
      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Accurate variant pathogenicity predictions are important in genetic studies of human diseases. Inframe insertion and deletion variants (indels) alter protein sequence and length, but not as deleterious as frameshift indels. Inframe indel Interpretation is challenging due to limitations in the available number of known pathogenic variants for training. Existing prediction methods largely use manually encoded features including conservation, protein structure and function, and allele frequency to infer variant pathogenicity. Recent advances in deep learning modeling of protein sequences and structures provide an opportunity to improve the representation of salient features based on large numbers of protein sequences. We developed a new pathogenicity predictor for SHort Inframe iNsertion and dEletion (SHINE). SHINE uses pretrained protein language models to construct a latent representation of an indel and its protein context from protein sequences and multiple protein sequence alignments, and feeds the latent representation into supervised machine learning models for pathogenicity prediction. We curated training data from ClinVar and gnomAD, and created two test datasets from different sources. SHINE achieved better prediction performance than existing methods for both deletion and insertion variants in these two test datasets. Our work suggests that unsupervised protein language models can provide valuable information about proteins, and new methods based on these models can improve variant interpretation in genetic analyses.

          Related collections

          Author and article information

          Journal
          Brief Bioinform
          Briefings in bioinformatics
          Oxford University Press (OUP)
          1477-4054
          1467-5463
          Jan 19 2023
          : 24
          : 1
          Affiliations
          [1 ] Department of Pediatrics, Columbia University, New York, NY, USA.
          [2 ] Department of Systems Biology, Columbia University, New York, NY, USA.
          [3 ] Department of Biomedical Informatics, Columbia University, New York, NY, USA.
          [4 ] Lynbrook High School, San Jose, CA, USA.
          [5 ] Department of Medicine, Columbia University, New York, NY, USA.
          [6 ] JP Sulzberger Columbia Genome Center, Columbia University, New York, NY, USA.
          Article
          6961792
          10.1093/bib/bbac584
          9851320
          36575831
          8a27fda9-d551-400a-82c3-ea9e85311f79
          History

          protein language model,variant pathogenicity,transformer,transfer learning,inframe indel

          Comments

          Comment on this article