10
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      Prediction of aqueous intrinsic solubility of druglike molecules using Random Forest regression trained with Wiki-pS0 database

      research-article

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          The accurate prediction of solubility of drugs is still problematic. It was thought for a long time that shortfalls had been due the lack of high-quality solubility data from the chemical space of drugs. This study considers the quality of solubility data, particularly of ionizable drugs. A database is described, comprising 6355 entries of intrinsic solubility for 3014 different molecules, drawing on 1325 citations. In an earlier publication, many factors affecting the quality of the measurement had been discussed, and suggestions were offered to improve ways of extracting more reliable information from legacy data. Many of the suggestions have been implemented in this study. By correcting solubility for ionization (i.e., deriving intrinsic solubility, S 0) and by normalizing temperature (by transforming measurements performed in the range 10-50 °C to 25 °C), it can now be estimated that the average interlaboratory reproducibility is 0.17 log unit. Empirical methods to predict solubility at best have hovered around the root mean square error (RMSE) of 0.6 log unit. Three prediction methods are compared here: (a) Yalkowsky’s general solubility equation (GSE), (b) Abraham solvation equation (ABSOLV), and (c) Random Forest regression (RFR) statistical machine learning. The latter two methods were trained using the new database. The RFR method outperforms the other two models, as anticipated. However, the ability to predict the solubility of drugs to the level of the quality of data is still out of reach. The data quality is not the limiting factor in prediction. The statistical machine learning methodologies are probably up to the task. Possibly what’s missing are solubility data from a few sparsely-covered chemical space of drugs (particularly of research compounds). Also, new descriptors which can better differentiate the factors affecting solubility between molecules could be critical for narrowing the gap between the accuracy of the prediction models and that of the experimental data.

          Related collections

          Most cited references188

          • Record: found
          • Abstract: found
          • Article: not found

          Random forest: a classification and regression tool for compound classification and QSAR modeling.

          A new classification and regression tool, Random Forest, is introduced and investigated for predicting a compound's quantitative or categorical biological activity based on a quantitative description of the compound's molecular structure. Random Forest is an ensemble of unpruned classification or regression trees created by using bootstrap samples of the training data and random feature selection in tree induction. Prediction is made by aggregating (majority vote or averaging) the predictions of the ensemble. We built predictive models for six cheminformatics data sets. Our analysis demonstrates that Random Forest is a powerful tool capable of delivering performance that is among the most accurate methods to date. We also present three additional features of Random Forest: built-in performance assessment, a measure of relative importance of descriptors, and a measure of compound similarity that is weighted by the relative importance of descriptors. It is the combination of relatively high prediction accuracy and its collection of desired features that makes Random Forest uniquely suited for modeling in cheminformatics.
            Bookmark
            • Record: found
            • Abstract: found
            • Article: not found

            Experimental and computational approaches to estimate solubility and permeability in drug discovery and development settings.

            Experimental and computational approaches to estimate solubility and permeability in discovery and development settings are described. In the discovery setting 'the rule of 5' predicts that poor absorption or permeation is more likely when there are more than 5 H-bond donors, 10 H-bond acceptors, the molecular weight (MWT) is greater than 500 and the calculated Log P (CLogP) is greater than 5 (or MlogP > 4.15). Computational methodology for the rule-based Moriguchi Log P (MLogP) calculation is described. Turbidimetric solubility measurement is described and applied to known drugs. High throughput screening (HTS) leads tend to have higher MWT and Log P and lower turbidimetric solubility than leads in the pre-HTS era. In the development setting, solubility calculations focus on exact value prediction and are difficult because of polymorphism. Recent work on linear free energy relationships and Log P approaches are critically reviewed. Useful predictions are possible in closely related analog series when coupled with experimental thermodynamic solubility measurements.
              Bookmark
              • Record: found
              • Abstract: found
              • Article: not found

              Drug-like properties and the causes of poor solubility and poor permeability.

              C Lipinski (2001)
              There are currently about 10000 drug-like compounds. These are sparsely, rather than uniformly, distributed through chemistry space. True diversity does not exist in experimental combinatorial chemistry screening libraries. Absorption, distribution, metabolism, and excretion (ADME) and chemical reactivity-related toxicity is low, while biological receptor activity is higher dimensional in chemistry space, and this is partly explainable by evolutionary pressures on ADME to deal with endobiotics and exobiotics. ADME is hard to predict for large data sets because current ADME experimental screens are multi-mechanisms, and predictions get worse as more data accumulates. Currently, screening for biological receptor activity precedes or is concurrent with screening for properties related to "drugability." In the future, "drugability" screening may precede biological receptor activity screening. The level of permeability or solubility needed for oral absorption is related to potency. The relative importance of poor solubility and poor permeability towards the problem of poor oral absorption depends on the research approach used for lead generation. A "rational drug design" approach as exemplified by Merck advanced clinical candidates leads to time-dependent higher molecular weight, higher H-bonding properties, unchanged lipophilicity, and, hence, poorer permeability. A high throughput screening (HTS)-based approach as exemplified by unpublished data on Pfizer (Groton, CT) early candidates leads to higher molecular weight, unchanged H-bonding properties, higher lipophilicity, and, hence, poorer aqueous solubility.
                Bookmark

                Author and article information

                Journal
                ADMET DMPK
                ADMET DMPK
                ADMET
                ADMET & DMPK
                International Association of Physical Chemists
                1848-7718
                04 March 2020
                2020
                : 8
                : 1
                : 29-77
                Affiliations
                [1]in-ADME Research , 1732 First Avenue #102, New York, NY 10128 USA
                Author notes
                *Corresponding Author: E-mail: alex@ 123456in-adme.com ; Tel: +1-646-678-5713
                Article
                10.5599/admet.766
                8915599
                35299775
                1a5a57a9-a8a0-42ee-82cc-dfc23980ced7
                Copyright © 2020 by the authors.

                This article is an open-access article distributed under the terms and conditions of the Creative Commons Attribution license ( http://creativecommons.org/licenses/by/4.0/).

                History
                : 09 December 2019
                : 17 February 2020
                Page count
                Figures: 24, Tables: 6, Equations: 4, References: 196, Pages: 49
                Categories
                Original Scientific Papers

                aqueous intrinsic solubility,druglike,interlaboratory experimental error,pdisol-x,general solubility equation (gse),abraham solvation equation (absolv),multiple linear regression (mlr),random forest regression (rfr),quantitative structure-property relationships (qspr)

                Comments

                Comment on this article