Accounting for multiple imputation-induced variability for differential analysis in mass spectrometry-based label-free quantitative proteomics

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

Imputing missing values is common practice in label-free quantitative proteomics. Imputation aims at replacing a missing value with a user-defined one. However, the imputation itself may not be optimally considered downstream of the imputation process, as imputed datasets are often considered as if they had always been complete. Hence, the uncertainty due to the imputation is not adequately taken into account. We provide a rigorous multiple imputation strategy, leading to a less biased estimation of the parameters’ variability thanks to Rubin’s rules. The imputation-based peptide’s intensities’ variance estimator is then moderated using Bayesian hierarchical models. This estimator is finally included in moderated t-test statistics to provide differential analyses results. This workflow can be used both at peptide and protein-level in quantification datasets. Indeed, an aggregation step is included for protein-level results based on peptide-level quantification data. Our methodology, named mi4p, was compared to the state-of-the-art limma workflow implemented in the DAPAR R package, both on simulated and real datasets. We observed a trade-off between sensitivity and specificity, while the overall performance of mi4p outperforms DAPAR in terms of F-Score.

Author summary

Statistical inference methods commonly used in quantitative proteomics are based on the measurement of peptide intensities. They allow the deduction of protein abundances provided that sufficient peptides per protein are available. However, they do not satisfactorily consider peptides or proteins whose intensities are missing under certain conditions, even though they are particularly interesting from a biological or medical point of view, since they may explain a difference between the groups being compared. Some state-of-the-art statistical proteomics data processing software proposes to impute these missing values, while others simply remove proteins with too many missing peptides. The statistical treatment is not entirely satisfactory when imputation methods are used, notably multiple imputation techniques. Indeed, even if these statistical tools are relevant in this context, the data sets once imputed are considered as having always been complete in the subsequent analyses: the uncertainty caused by the imputation is not taken into account. These analyses generally conclude with a study of the differences in protein abundances between the different conditions, either using Student’s or Welch’s test for the most rudimentary approaches or using the t-tempered testing techniques based on empirical Bayesian approaches. Thus, we propose a new methodology that starts by imputing missing values at the peptide level and estimating the uncertainty associated with this imputation and naturally extends by incorporating this uncertainty into the current moderated variance estimation techniques.

Related collections

Most cited references 33

Record: found
Abstract: not found
Article: not found

Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing

Yoav Benjamini, Yosef Hochberg (1995)

0 comments Cited 25360 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: not found
Article: not found

mice: Multivariate Imputation by Chained Equations inR

Stef van Buuren, Karin Groothuis-Oudshoorn (2011)

0 comments Cited 3200 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

The Perseus computational platform for comprehensive analysis of (prote)omics data.

Stefka Tyanova, Tikira Temu, Pavel Sinitcyn … (2016)

A main bottleneck in proteomics is the downstream biological analysis of highly multivariate quantitative protein abundance data generated using mass-spectrometry-based analysis. We developed the Perseus software platform (http://www.perseus-framework.org) to support biological and biomedical researchers in interpreting protein quantification, interaction and post-translational modification data. Perseus contains a comprehensive portfolio of statistical tools for high-dimensional omics data analysis covering normalization, pattern recognition, time-series analysis, cross-omics comparisons and multiple-hypothesis testing. A machine learning module supports the classification and validation of patient groups for diagnosis and prognosis, and it also detects predictive protein signatures. Central to Perseus is a user-friendly, interactive workflow environment that provides complete documentation of computational methods used in a publication. All activities in Perseus are realized as plugins, and users can extend the software by programming their own, which can be shared through a plugin store. We anticipate that Perseus's arsenal of algorithms and its intuitive usability will empower interdisciplinary analysis of complex large data sets.

0 comments Cited 2376 times – based on 0 reviews      Review now

Bookmark

All references

Author and article information

Contributors

Marie Chion:

ORCID: https://orcid.org/0000-0001-8956-8388

Role: ConceptualizationRole: Data curationRole: Formal analysisRole: InvestigationRole: MethodologyRole: ResourcesRole: SoftwareRole: ValidationRole: VisualizationRole: Writing – original draftRole: Writing – review & editing

Christine Carapito:

ORCID: https://orcid.org/0000-0002-0079-319X

Role: ConceptualizationRole: Funding acquisitionRole: Project administrationRole: SupervisionRole: ValidationRole: Writing – original draftRole: Writing – review & editing

Frédéric Bertrand:

ORCID: https://orcid.org/0000-0002-0837-8281

Role: ConceptualizationRole: Funding acquisitionRole: MethodologyRole: Project administrationRole: SoftwareRole: SupervisionRole: Writing – original draftRole: Writing – review & editing

Wout Bittremieux: Role: Editor

Journal

Journal ID (nlm-ta): PLoS Comput Biol

Journal ID (iso-abbrev): PLoS Comput Biol

Journal ID (publisher-id): plos

Title: PLoS Computational Biology

Publisher: Public Library of Science (San Francisco, CA USA )

ISSN (Print): 1553-734X

ISSN (Electronic): 1553-7358

Publication date Collection: August 2022

Publication date (Electronic): 29 August 2022

Volume: 18

Issue: 8

Electronic Location Identifier: e1010420

Affiliations

[1 ] Institut de Recherche Mathématique Avancée, UMR 7501, CNRS-Université de Strasbourg, Strasbourg, France

[2 ] Laboratoire de Spectrométrie de Masse Bio-Organique, Institut Pluridisciplinaire Hubert Curien, UMR 7178, CNRS-Université de Strasbourg, Strasbourg, France

[3 ] Laboratoire Mathématiques appliquées à Paris 5, UMR 8145, CNRS-Université Paris Cité, Paris, France

[4 ] Infrastructure Nationale de Protéomique ProFi - FR2048, 67087 Strasbourg, France

[5 ] Laboratoire Informatique et Société Numérique, Université de Technologie de Troyes, Troyes, France

University of California San Diego, UNITED STATES

Author notes

The authors have declared that no competing interests exist.

* E-mail: marie.chion@ 123456protonmail.com

Author information

Marie Chion https://orcid.org/0000-0001-8956-8388

Christine Carapito https://orcid.org/0000-0002-0079-319X

Frédéric Bertrand https://orcid.org/0000-0002-0837-8281

Article

Publisher ID: PCOMPBIOL-D-22-00385

DOI: 10.1371/journal.pcbi.1010420

PMC ID: 9462777

PubMed ID: 36037245

SO-VID: ebfe04f7-0272-4d54-8033-aa431be32e58

License:

This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

History

Date received : 11 March 2022

Date accepted : 21 July 2022

Page count

Figures: 11, Tables: 6, Pages: 26

Funding

Funded by: Agence Nationale de la Recherche (FR)

Award ID: ANR-11-LABX-0055_IRMIA

Award Recipient :

ORCID: https://orcid.org/0000-0001-8956-8388

Marie Chion

This work was funded through a PhD grant (2018-2021) awarded to MC and received by FB and CC from the Agence Nationale de la Recherche (ANR) through the Labex IRMIA [ANR-11-LABX-0055 IRMIA]. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Custom metadata

PLOS Publication Stage vor-update-to-uncorrected-proof

Publication Update 2022-09-09

Data Availability Our mi4p algorithm is implemented under the R environment in the mi4p package that is publicly available on the CRAN. The development version, as well as the R scripts which led to the results presented, can also be found on a GitHub repository ( https://github.com/mariechion/mi4p). The spiked yeast dataset and the Arabidopsis thaliana spiked dataset are public and accessible on the ProteomeXchange website using the identifiers PXD003841 and PXD027800.

Accounting for multiple imputation-induced variability for differential analysis in mass spectrometry-based label-free quantitative proteomics

Read this article at

Abstract

Author summary

Related collections

Journal of Systems Thinking

Most cited references 33

Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing

mice: Multivariate Imputation by Chained Equations inR

The Perseus computational platform for comprehensive analysis of (prote)omics data.

Author and article information

Contributors

Journal

Affiliations

Author notes

Author information

Article

History

Page count

Funding

Categories

Custom metadata

Comments

Comment on this article

Similar content 277

Cited by 2

Most referenced authors 770