Training machine learning algorithms with protein-ligand descriptors has recently gained considerable attention to predict binding constants from atomic coordinates. Starting from a series of recent reports stating the advantages of this approach over empirical scoring functions, we could indeed reproduce the claimed superiority of Random Forest and Support Vector Machine-based scoring functions to predict experimental binding constants from protein-ligand X-ray structures of the PDBBind dataset. Strikingly, these scoring functions, trained on simple protein-ligand element-element distance counts, were almost unable to enrich virtual screening hit lists in true actives upon docking experiments of 10 reference DUD-E datasets; this is a a feature that, however, has been verified for an a priori less-accurate empirical scoring function (Surflex-Dock). By systematically varying ligand poses from true X-ray coordinates, we show that the Surflex-Dock scoring function is logically sensitive to the quality of docking poses. Conversely, our machine-learning based scoring functions are totally insensitive to docking poses (up to 10 Å root-mean square deviations) and just describe atomic element counts. This report does not disqualify using machine learning algorithms to design scoring functions. Protein-ligand element-element distance counts should however be used with extreme caution and only applied in a meaningful way. To avoid developing novel but meaningless scoring functions, we propose that two additional benchmarking tests must be systematically done when developing novel scoring functions: (i) sensitivity to docking pose accuracy, and (ii) ability to enrich hit lists in true actives upon structure-based (docking, receptor-ligand pharmacophore) virtual screening of reference datasets.
See how this article has been cited at scite.ai
scite shows how a scientific paper has been cited by providing the context of the citation, a classification describing whether it supports, mentions, or contrasts the cited claim, and a label indicating in which section the citation was made.