Beware of machine learning-based scoring functions-on the danger of developing black boxes.

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

Training machine learning algorithms with protein-ligand descriptors has recently gained considerable attention to predict binding constants from atomic coordinates. Starting from a series of recent reports stating the advantages of this approach over empirical scoring functions, we could indeed reproduce the claimed superiority of Random Forest and Support Vector Machine-based scoring functions to predict experimental binding constants from protein-ligand X-ray structures of the PDBBind dataset. Strikingly, these scoring functions, trained on simple protein-ligand element-element distance counts, were almost unable to enrich virtual screening hit lists in true actives upon docking experiments of 10 reference DUD-E datasets; this is a a feature that, however, has been verified for an a priori less-accurate empirical scoring function (Surflex-Dock). By systematically varying ligand poses from true X-ray coordinates, we show that the Surflex-Dock scoring function is logically sensitive to the quality of docking poses. Conversely, our machine-learning based scoring functions are totally insensitive to docking poses (up to 10 Å root-mean square deviations) and just describe atomic element counts. This report does not disqualify using machine learning algorithms to design scoring functions. Protein-ligand element-element distance counts should however be used with extreme caution and only applied in a meaningful way. To avoid developing novel but meaningless scoring functions, we propose that two additional benchmarking tests must be systematically done when developing novel scoring functions: (i) sensitivity to docking pose accuracy, and (ii) ability to enrich hit lists in true actives upon structure-based (docking, receptor-ligand pharmacophore) virtual screening of reference datasets.

Related collections

Author and article information

Journal

Journal ID (iso-abbrev): J Chem Inf Model

Title: Journal of chemical information and modeling

ISSN (Electronic): 1549-960X

ISSN (Print): 1549-9596

Publication date (Electronic): Oct 27 2014

Volume: 54

Issue: 10

Affiliations

[1 ] Laboratoire d'Innovation Thérapeutique, UMR 7200 CNRS-Université de Strasbourg , 74 route du Rhin, F-67400 Illkirch, France.

Article

DOI: 10.1021/ci500406k

PubMed ID: 25207678

SO-VID: 9de328f1-e470-4f24-ada9-7ed8d8f17b52

History

Data availability:

Comments

Comment on this article

scite_

Smart Citations

Citing PublicationsSupportingMentioningContrasting

View Citations

See how this article has been cited at scite.ai

scite shows how a scientific paper has been cited by providing the context of the citation, a classification describing whether it supports, mentions, or contrasts the cited claim, and a label indicating in which section the citation was made.