Pushing the limits of solubility prediction via quality-oriented data selection

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Summary

Accurate prediction of the solubility of chemical substances in solvents remains a challenge. The sparsity of high-quality solubility data is recognized as the biggest hurdle in the development of robust data-driven methods for practical use. Nonetheless, the effects of the quality and quantity of data on aqueous solubility predictions have not yet been scrutinized. In this study, the roles of the size and the quality of data sets on the performances of the solubility prediction models are unraveled, and the concepts of actual and observed performances are introduced. In an effort to curtail the gap between actual and observed performances, a quality-oriented data selection method, which evaluates the quality of data and extracts the most accurate part of it through statistical validation, is designed. Applying this method on the largest publicly available solubility database and using a consensus machine learning approach, a top-performing solubility prediction model is achieved.

Graphical Abstract

Highlights

•

Consensus machine learning models perform better than singular models
•

Quality-oriented data selection yields better results than using all data
•

The uncertainty of test data determines the theoretical limit of a model's performance
•

The concepts of actual and observed performances of solubility models are introduced

Abstract

Chemistry; Analytical Reagents; Computational Chemistry; Artificial Intelligence

Related collections

Most cited references 43

Record: found
Abstract: found
Article: found

Is Open Access

SwissADME: a free web tool to evaluate pharmacokinetics, drug-likeness and medicinal chemistry friendliness of small molecules

Antoine Daina, Olivier Michielin, Vincent Zoete (2017)

To be effective as a drug, a potent molecule must reach its target in the body in sufficient concentration, and stay there in a bioactive form long enough for the expected biologic events to occur. Drug development involves assessment of absorption, distribution, metabolism and excretion (ADME) increasingly earlier in the discovery process, at a stage when considered compounds are numerous but access to the physical samples is limited. In that context, computer models constitute valid alternatives to experiments. Here, we present the new SwissADME web tool that gives free access to a pool of fast yet robust predictive models for physicochemical properties, pharmacokinetics, drug-likeness and medicinal chemistry friendliness, among which in-house proficient methods such as the BOILED-Egg, iLOGP and Bioavailability Radar. Easy efficient input and interpretation are ensured thanks to a user-friendly interface through the login-free website http://www.swissadme.ch. Specialists, but also nonexpert in cheminformatics or computational chemistry can predict rapidly key parameters for a collection of molecules to support their drug discovery endeavours.

0 comments Cited 2706 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

PaDEL-descriptor: an open source software to calculate molecular descriptors and fingerprints.

Chun Wei Yap (2011)

PaDEL-Descriptor is a software for calculating molecular descriptors and fingerprints. The software currently calculates 797 descriptors (663 1D, 2D descriptors, and 134 3D descriptors) and 10 types of fingerprints. These descriptors and fingerprints are calculated mainly using The Chemistry Development Kit. Some additional descriptors and fingerprints were added, which include atom type electrotopological state descriptors, McGowan volume, molecular linear free energy relation descriptors, ring counts, count of chemical substructures identified by Laggner, and binary fingerprints and count of chemical substructures identified by Klekota and Roth. PaDEL-Descriptor was developed using the Java language and consists of a library component and an interface component. The library component allows it to be easily integrated into quantitative structure activity relationship software to provide the descriptor calculation feature while the interface component allows it to be used as a standalone software. The software uses a Master/Worker pattern to take advantage of the multiple CPU cores that are present in most modern computers to speed up calculations of molecular descriptors. The software has several advantages over existing standalone molecular descriptor calculation software. It is free and open source, has both graphical user interface and command line interfaces, can work on all major platforms (Windows, Linux, MacOS), supports more than 90 different molecular file formats, and is multithreaded. PaDEL-Descriptor is a useful addition to the currently available molecular descriptor calculation software. The software can be downloaded at http://padel.nus.edu.sg/software/padeldescriptor. Copyright © 2010 Wiley Periodicals, Inc.

0 comments Cited 406 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: not found
Article: not found

Visualizing data using t‐SNE

Maaten d, L. Van Der Maaten, G Hinton … (2008)

0 comments Cited 291 times – based on 0 reviews      Review now

Bookmark

All references

Author and article information

Contributors

Süleyman Er

Journal

Journal ID (nlm-ta): iScience

Journal ID (iso-abbrev): iScience

Title: iScience

Publisher: Elsevier

ISSN (Electronic): 2589-0042

Publication date PMC-release: 17 December 2020

Publication date Collection: 22 January 2021

Publication date (Electronic): 17 December 2020

Volume: 24

Issue: 1

Electronic Location Identifier: 101961

Affiliations

[1 ]DIFFER - Dutch Institute for Fundamental Energy Research, De Zaale 20, 5612 AJ Eindhoven, the Netherlands

[2 ]CCER - Center for Computational Energy Research, De Zaale 20, 5612 AJ Eindhoven, the Netherlands

[3 ]Department of Applied Physics, Eindhoven University of Technology, 5600 MB Eindhoven, the Netherlands

Author notes

[∗ ]Corresponding author s.er@ 123456differ.nl

[4]

Lead contact

Article

Publisher Item ID: S2589-0042(20)31158-5 Publisher ID: 101961

DOI: 10.1016/j.isci.2020.101961

PMC ID: 7788089

PubMed ID: 33437941

SO-VID: 3327adf2-c32d-4dc9-bd3c-af0d10780ca1

License:

This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).

History

Date received : 16 October 2020

Date revision received : 18 November 2020

Date accepted : 15 December 2020

Comments

Comment on this article

scite_

Cited by 12

See all cited by

Most referenced authors 1,463

See all reference authors

- Version 1

Pushing the limits of solubility prediction via quality-oriented data selection

Read this article at

Summary

Graphical Abstract

Highlights

Abstract

Related collections

EPA CompTox Chemicals Dashboard

Most cited references 43

SwissADME: a free web tool to evaluate pharmacokinetics, drug-likeness and medicinal chemistry friendliness of small molecules

PaDEL-descriptor: an open source software to calculate molecular descriptors and fingerprints.

Visualizing data using t‐SNE

Author and article information

Contributors

Journal

Affiliations

Author notes

Article

History

Categories

Comments

Comment on this article

Similar content 202

Cited by 12

Most referenced authors 1,463