1
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      Positional SHAP (PoSHAP) for Interpretation of machine learning models trained from biological sequences

      research-article
      , *
      PLoS Computational Biology
      Public Library of Science

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Machine learning with multi-layered artificial neural networks, also known as “deep learning,” is effective for making biological predictions. However, model interpretation is challenging, especially for sequential input data used with recurrent neural network architectures. Here, we introduce a framework called “Positional SHAP” (PoSHAP) to interpret models trained from biological sequences by utilizing SHapely Additive exPlanations (SHAP) to generate positional model interpretations. We demonstrate this using three long short-term memory (LSTM) regression models that predict peptide properties, including binding affinity to major histocompatibility complexes (MHC), and collisional cross section (CCS) measured by ion mobility spectrometry. Interpretation of these models with PoSHAP reproduced MHC class I (rhesus macaque Mamu-A1*001 and human A*11:01) peptide binding motifs, reflected known properties of peptide CCS, and provided new insights into interpositional dependencies of amino acid interactions. PoSHAP should have widespread utility for interpreting a variety of models trained from biological sequences.

          Author summary

          Machine learning enables biochemical predictions. However, the relationships learned by many algorithms are not directly interpretable. Model interpretation methods are important because they enable human comprehension of learned relationships. Methods likeSHapely Additive exPlanations were developed to determine how each input alters the model prediction. However, interpretation of models trained from biological sequences remains more challenging; model interpretation often ignores ordering of inputs. Here, we train machine learning models using biological sequence data as an input to predict peptide collisional cross section, and to predict peptide binding affinity to major histocompatibility complex (MHC) isoforms. To enable positional interpretation of our predictions, we add indexes to the inputs to track SHAP explanations calculated from the models. Our results demonstrate that positional interpretation of models recapitulates known biochemistry and reveals new biochemistry. This positional SHAP (PoSHAP) conceptual framework provides a foothold for interpretation of other models trained from biological sequences.

          Related collections

          Most cited references59

          • Record: found
          • Abstract: not found
          • Article: not found

          Random Forests

            Bookmark
            • Record: found
            • Abstract: found
            • Article: not found

            Long Short-Term Memory

            Learning to store information over extended time intervals by recurrent backpropagation takes a very long time, mostly because of insufficient, decaying error backflow. We briefly review Hochreiter's (1991) analysis of this problem, then address it by introducing a novel, efficient, gradient-based method called long short-term memory (LSTM). Truncating the gradient where this does not do harm, LSTM can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units. Multiplicative gate units learn to open and close access to the constant error flow. LSTM is local in space and time; its computational complexity per time step and weight is O(1). Our experiments with artificial data involve local, distributed, real-valued, and noisy pattern representations. In comparisons with real-time recurrent learning, back propagation through time, recurrent cascade correlation, Elman nets, and neural sequence chunking, LSTM leads to many more successful runs, and learns much faster. LSTM also solves complex, artificial long-time-lag tasks that have never been solved by previous recurrent network algorithms.
              Bookmark
              • Record: found
              • Abstract: not found
              • Article: not found

              Matplotlib: A 2D Graphics Environment

                Bookmark

                Author and article information

                Contributors
                Role: ConceptualizationRole: Data curationRole: Formal analysisRole: InvestigationRole: MethodologyRole: ResourcesRole: SoftwareRole: ValidationRole: VisualizationRole: Writing – original draftRole: Writing – review & editing
                Role: ConceptualizationRole: Data curationRole: Formal analysisRole: Funding acquisitionRole: InvestigationRole: MethodologyRole: Project administrationRole: ResourcesRole: SoftwareRole: SupervisionRole: ValidationRole: VisualizationRole: Writing – original draftRole: Writing – review & editing
                Role: Editor
                Journal
                PLoS Comput Biol
                PLoS Comput Biol
                plos
                PLoS Computational Biology
                Public Library of Science (San Francisco, CA USA )
                1553-734X
                1553-7358
                28 January 2022
                January 2022
                : 18
                : 1
                : e1009736
                Affiliations
                [001] Department of Biochemistry, Medical College of Wisconsin, Milwaukee, Wisconsin
                CANADA
                Author notes

                The authors have declared that no competing interests exist.

                Author information
                https://orcid.org/0000-0002-7744-3083
                https://orcid.org/0000-0003-2753-3926
                Article
                PCOMPBIOL-D-21-01694
                10.1371/journal.pcbi.1009736
                8797255
                35089914
                a09b5b60-b558-4b08-923c-144efd003e54
                © 2022 Dickinson, Meyer

                This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

                History
                : 16 September 2021
                : 9 December 2021
                Page count
                Figures: 8, Tables: 0, Pages: 24
                Funding
                Funded by: funder-id http://dx.doi.org/10.13039/100000057, National Institute of General Medical Sciences;
                Award ID: R35GM142502
                Award Recipient :
                Funded by: funder-id http://dx.doi.org/10.13039/100000092, U.S. National Library of Medicine;
                Award ID: T15LM007359
                Award Recipient :
                This work was supported by the National Institutes of Health (NIH), including the National Institute of General Medical Sciences (NIGMS, https://www.nigms.nih.gov/) award number R35 GM142502 to JGM, and the National Library of Medicine (NLM, https://www.nlm.nih.gov/) training grant award number T15 LM007359 to JGM (PI: Mark Craven). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
                Categories
                Research Article
                Biology and Life Sciences
                Molecular Biology
                Molecular Biology Techniques
                Molecular Biology Assays and Analysis Techniques
                Amino Acid Analysis
                Research and Analysis Methods
                Molecular Biology Techniques
                Molecular Biology Assays and Analysis Techniques
                Amino Acid Analysis
                Computer and Information Sciences
                Artificial Intelligence
                Machine Learning
                Medicine and Health Sciences
                Clinical Medicine
                Clinical Immunology
                Major Histocompatibility Complex
                Biology and Life Sciences
                Immunology
                Clinical Immunology
                Major Histocompatibility Complex
                Medicine and Health Sciences
                Immunology
                Clinical Immunology
                Major Histocompatibility Complex
                Biology and Life Sciences
                Immunology
                Major Histocompatibility Complex
                Medicine and Health Sciences
                Immunology
                Major Histocompatibility Complex
                Computer and Information Sciences
                Artificial Intelligence
                Machine Learning
                Deep Learning
                Physical Sciences
                Chemistry
                Chemical Compounds
                Organic Compounds
                Amino Acids
                Hydroxyl Amino Acids
                Serine
                Physical Sciences
                Chemistry
                Organic Chemistry
                Organic Compounds
                Amino Acids
                Hydroxyl Amino Acids
                Serine
                Biology and Life Sciences
                Biochemistry
                Proteins
                Amino Acids
                Hydroxyl Amino Acids
                Serine
                Physical Sciences
                Chemistry
                Chemical Compounds
                Organic Compounds
                Amino Acids
                Hydroxyl Amino Acids
                Threonine
                Physical Sciences
                Chemistry
                Organic Chemistry
                Organic Compounds
                Amino Acids
                Hydroxyl Amino Acids
                Threonine
                Biology and Life Sciences
                Biochemistry
                Proteins
                Amino Acids
                Hydroxyl Amino Acids
                Threonine
                Physical Sciences
                Chemistry
                Chemical Compounds
                Organic Compounds
                Amino Acids
                Basic Amino Acids
                Histidine
                Physical Sciences
                Chemistry
                Organic Chemistry
                Organic Compounds
                Amino Acids
                Basic Amino Acids
                Histidine
                Biology and Life Sciences
                Biochemistry
                Proteins
                Amino Acids
                Basic Amino Acids
                Histidine
                Physical Sciences
                Chemistry
                Chemical Compounds
                Organic Compounds
                Amino Acids
                Cyclic Amino Acids
                Proline
                Physical Sciences
                Chemistry
                Organic Chemistry
                Organic Compounds
                Amino Acids
                Cyclic Amino Acids
                Proline
                Biology and Life Sciences
                Biochemistry
                Proteins
                Amino Acids
                Cyclic Amino Acids
                Proline
                Custom metadata
                Data and code are available from: https://github.com/jessegmeyerlab/positional-SHAP The data including all points used to create the main figures are available from zenodo https://zenodo.org/record/5711162#.YZaK-57MJ6I.

                Quantitative & Systems biology
                Quantitative & Systems biology

                Comments

                Comment on this article