Prediction of aqueous intrinsic solubility of druglike molecules using Random Forest regression trained with Wiki-pS0 database

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

The accurate prediction of solubility of drugs is still problematic. It was thought for a long time that shortfalls had been due the lack of high-quality solubility data from the chemical space of drugs. This study considers the quality of solubility data, particularly of ionizable drugs. A database is described, comprising 6355 entries of intrinsic solubility for 3014 different molecules, drawing on 1325 citations. In an earlier publication, many factors affecting the quality of the measurement had been discussed, and suggestions were offered to improve ways of extracting more reliable information from legacy data. Many of the suggestions have been implemented in this study. By correcting solubility for ionization (i.e., deriving intrinsic solubility, S ₀) and by normalizing temperature (by transforming measurements performed in the range 10-50 °C to 25 °C), it can now be estimated that the average interlaboratory reproducibility is 0.17 log unit. Empirical methods to predict solubility at best have hovered around the root mean square error (RMSE) of 0.6 log unit. Three prediction methods are compared here: (a) Yalkowsky’s general solubility equation (GSE), (b) Abraham solvation equation (ABSOLV), and (c) Random Forest regression (RFR) statistical machine learning. The latter two methods were trained using the new database. The RFR method outperforms the other two models, as anticipated. However, the ability to predict the solubility of drugs to the level of the quality of data is still out of reach. The data quality is not the limiting factor in prediction. The statistical machine learning methodologies are probably up to the task. Possibly what’s missing are solubility data from a few sparsely-covered chemical space of drugs (particularly of research compounds). Also, new descriptors which can better differentiate the factors affecting solubility between molecules could be critical for narrowing the gap between the accuracy of the prediction models and that of the experimental data.

Related collections

Most cited references 188

Record: found
Abstract: found
Article: not found

Random forest: a classification and regression tool for compound classification and QSAR modeling.

Vladimir Svetnik, Andy Liaw, Christopher Tong … (2003)

A new classification and regression tool, Random Forest, is introduced and investigated for predicting a compound's quantitative or categorical biological activity based on a quantitative description of the compound's molecular structure. Random Forest is an ensemble of unpruned classification or regression trees created by using bootstrap samples of the training data and random feature selection in tree induction. Prediction is made by aggregating (majority vote or averaging) the predictions of the ensemble. We built predictive models for six cheminformatics data sets. Our analysis demonstrates that Random Forest is a powerful tool capable of delivering performance that is among the most accurate methods to date. We also present three additional features of Random Forest: built-in performance assessment, a measure of relative importance of descriptors, and a measure of compound similarity that is weighted by the relative importance of descriptors. It is the combination of relatively high prediction accuracy and its collection of desired features that makes Random Forest uniquely suited for modeling in cheminformatics.

0 comments Cited 773 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

Experimental and computational approaches to estimate solubility and permeability in drug discovery and development settings.

C A Lipinski, F Lombardo, B W Dominy … (2001)

Experimental and computational approaches to estimate solubility and permeability in discovery and development settings are described. In the discovery setting 'the rule of 5' predicts that poor absorption or permeation is more likely when there are more than 5 H-bond donors, 10 H-bond acceptors, the molecular weight (MWT) is greater than 500 and the calculated Log P (CLogP) is greater than 5 (or MlogP > 4.15). Computational methodology for the rule-based Moriguchi Log P (MLogP) calculation is described. Turbidimetric solubility measurement is described and applied to known drugs. High throughput screening (HTS) leads tend to have higher MWT and Log P and lower turbidimetric solubility than leads in the pre-HTS era. In the development setting, solubility calculations focus on exact value prediction and are difficult because of polymorphism. Recent work on linear free energy relationships and Log P approaches are critically reviewed. Useful predictions are possible in closely related analog series when coupled with experimental thermodynamic solubility measurements.

0 comments Cited 669 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

Drug-like properties and the causes of poor solubility and poor permeability.

C Lipinski (2001)

There are currently about 10000 drug-like compounds. These are sparsely, rather than uniformly, distributed through chemistry space. True diversity does not exist in experimental combinatorial chemistry screening libraries. Absorption, distribution, metabolism, and excretion (ADME) and chemical reactivity-related toxicity is low, while biological receptor activity is higher dimensional in chemistry space, and this is partly explainable by evolutionary pressures on ADME to deal with endobiotics and exobiotics. ADME is hard to predict for large data sets because current ADME experimental screens are multi-mechanisms, and predictions get worse as more data accumulates. Currently, screening for biological receptor activity precedes or is concurrent with screening for properties related to "drugability." In the future, "drugability" screening may precede biological receptor activity screening. The level of permeability or solubility needed for oral absorption is related to potency. The relative importance of poor solubility and poor permeability towards the problem of poor oral absorption depends on the research approach used for lead generation. A "rational drug design" approach as exemplified by Merck advanced clinical candidates leads to time-dependent higher molecular weight, higher H-bonding properties, unchanged lipophilicity, and, hence, poorer permeability. A high throughput screening (HTS)-based approach as exemplified by unpublished data on Pfizer (Groton, CT) early candidates leads to higher molecular weight, unchanged H-bonding properties, higher lipophilicity, and, hence, poorer aqueous solubility.

0 comments Cited 298 times – based on 0 reviews      Review now

Bookmark

All references

Author and article information

Journal

Journal ID (nlm-ta): ADMET DMPK

Journal ID (iso-abbrev): ADMET DMPK

Journal ID (publisher-id): ADMET

Title: ADMET & DMPK

Publisher: International Association of Physical Chemists

ISSN (Electronic): 1848-7718

Publication date (Electronic): 04 March 2020

Publication date Collection: 2020

Volume: 8

Issue: 1

Pages: 29-77

Affiliations

[1]in-ADME Research , 1732 First Avenue #102, New York, NY 10128 USA

Author notes

*Corresponding Author: E-mail: alex@ 123456in-adme.com ; Tel: +1-646-678-5713

Article

DOI: 10.5599/admet.766

PMC ID: 8915599

PubMed ID: 35299775

SO-VID: 1a5a57a9-a8a0-42ee-82cc-dfc23980ced7

License:

This article is an open-access article distributed under the terms and conditions of the Creative Commons Attribution license ( http://creativecommons.org/licenses/by/4.0/).

History

Date received : 09 December 2019

Date revision received : 17 February 2020

Page count

Figures: 24, Tables: 6, Equations: 4, References: 196, Pages: 49

Comments

Comment on this article

scite_

Cited by 19

See all cited by

Most referenced authors 1,720

See all reference authors

Prediction of aqueous intrinsic solubility of druglike molecules using Random Forest regression trained with Wiki-pS0 database

Read this article at

Abstract

Related collections

Genome Engineering using CRISPR

Most cited references 188

Random forest: a classification and regression tool for compound classification and QSAR modeling.

Experimental and computational approaches to estimate solubility and permeability in drug discovery and development settings.

Drug-like properties and the causes of poor solubility and poor permeability.

Author and article information

Journal

Affiliations

Author notes

Article

History

Page count

Categories

Comments

Comment on this article

Similar content 77

Cited by 19

Most referenced authors 1,720