Deep-learning: investigating deep neural networks hyper-parameters and comparison of performance to shallow methods for modeling bioactivity data

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

Background

In recent years, research in artificial neural networks has resurged, now under the deep-learning umbrella, and grown extremely popular. Recently reported success of DL techniques in crowd-sourced QSAR and predictive toxicology competitions has showcased these methods as powerful tools in drug-discovery and toxicology research. The aim of this work was dual, first large number of hyper-parameter configurations were explored to investigate how they affect the performance of DNNs and could act as starting points when tuning DNNs and second their performance was compared to popular methods widely employed in the field of cheminformatics namely Naïve Bayes, k-nearest neighbor, random forest and support vector machines. Moreover, robustness of machine learning methods to different levels of artificially introduced noise was assessed. The open-source Caffe deep-learning framework and modern NVidia GPU units were utilized to carry out this study, allowing large number of DNN configurations to be explored.

Results

We show that feed-forward deep neural networks are capable of achieving strong classification performance and outperform shallow methods across diverse activity classes when optimized. Hyper-parameters that were found to play critical role are the activation function, dropout regularization, number hidden layers and number of neurons. When compared to the rest methods, tuned DNNs were found to statistically outperform, with p value <0.01 based on Wilcoxon statistical test. DNN achieved on average MCC units of 0.149 higher than NB, 0.092 than kNN, 0.052 than SVM with linear kernel, 0.021 than RF and finally 0.009 higher than SVM with radial basis function kernel. When exploring robustness to noise, non-linear methods were found to perform well when dealing with low levels of noise, lower than or equal to 20%, however when dealing with higher levels of noise, higher than 30%, the Naïve Bayes method was found to perform well and even outperform at the highest level of noise 50% more sophisticated methods across several datasets.

Electronic supplementary material

The online version of this article (doi:10.1186/s13321-017-0226-y) contains supplementary material, which is available to authorized users.

Related collections

Most cited references 40

Record: found
Abstract: not found
Article: not found

Individual Comparisons by Ranking Methods

Frank Wilcoxon (1945)

0 comments Cited 2122 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

Comparison of the predicted and observed secondary structure of T4 phage lysozyme.

B W Matthews (1975)

Predictions of the secondary structure of T4 phage lysozyme, made by a number of investigators on the basis of the amino acid sequence, are compared with the structure of the protein determined experimentally by X-ray crystallography. Within the amino terminal half of the molecule the locations of helices predicted by a number of methods agree moderately well with the observed structure, however within the carboxyl half of the molecule the overall agreement is poor. For eleven different helix predictions, the coefficients giving the correlation between prediction and observation range from 0.14 to 0.42. The accuracy of the predictions for both beta-sheet regions and for turns are generally lower than for the helices, and in a number of instances the agreement between prediction and observation is no better than would be expected for a random selection of residues. The structural predictions for T4 phage lysozyme are much less successful than was the case for adenylate kinase (Schulz et al. (1974) Nature 250, 140-142). No one method of prediction is clearly superior to all others, and although empirical predictions based on larger numbers of known protein structure tend to be more accurate than those based on a limited sample, the improvement in accuracy is not dramatic, suggesting that the accuracy of current empirical predictive methods will not be substantially increased simply by the inclusion of more data from additional protein structure determinations.

0 comments Cited 677 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: found

Is Open Access

Gene selection and classification of microarray data using random forest

Javier Díaz-Uriarte, Sara Alvarez de Andrés (2006)

Background Selection of relevant genes for sample classification is a common task in most gene expression studies, where researchers try to identify the smallest possible set of genes that can still achieve good predictive performance (for instance, for future use with diagnostic purposes in clinical practice). Many gene selection approaches use univariate (gene-by-gene) rankings of gene relevance and arbitrary thresholds to select the number of genes, can only be applied to two-class problems, and use gene selection ranking criteria unrelated to the classification algorithm. In contrast, random forest is a classification algorithm well suited for microarray data: it shows excellent performance even when most predictive variables are noise, can be used when the number of variables is much larger than the number of observations and in problems involving more than two classes, and returns measures of variable importance. Thus, it is important to understand the performance of random forest with microarray data and its possible use for gene selection. Results We investigate the use of random forest for classification of microarray data (including multi-class problems) and propose a new method of gene selection in classification problems based on random forest. Using simulated and nine microarray data sets we show that random forest has comparable performance to other classification methods, including DLDA, KNN, and SVM, and that the new gene selection procedure yields very small sets of genes (often smaller than alternative methods) while preserving predictive accuracy. Conclusion Because of its performance and features, random forest and gene selection using random forest should probably become part of the "standard tool-box" of methods for class prediction and gene selection with microarray data.

0 comments Cited 514 times – based on 0 reviews      Review now

Bookmark

All references

Author and article information

Contributors

Alexios Koutsoukas: alex.kouts@ku.edu

Keith J. Monaghan: k970m044@ku.edu

Xiaoli Li: xiaolili@ittc.ku.edu

Jun Huan:

ORCID: http://orcid.org/0000-0003-4929-2617

jhuan@ittc.ku.edu

Journal

Journal ID (nlm-ta): J Cheminform

Journal ID (iso-abbrev): J Cheminform

Title: Journal of Cheminformatics

Publisher: Springer International Publishing (Cham )

ISSN (Electronic): 1758-2946

Publication date (Electronic): 28 June 2017

Publication date PMC-release: 28 June 2017

Publication date Collection: 2017

Volume: 9

Electronic Location Identifier: 42

Affiliations

ISNI 0000 0001 2106 0692, GRID grid.266515.3, Department of Electrical Engineering and Computer Sciences, , University of Kansas, ; Lawrence, KS 66047-7621 USA

Author information

Jun Huan http://orcid.org/0000-0003-4929-2617

Article

Publisher ID: 226

DOI: 10.1186/s13321-017-0226-y

PMC ID: 5489441

PubMed ID: 28316652

SO-VID: d28c8fb7-a8dc-44a4-be92-687b75f60bb5

License:

Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License ( http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

History

Date received : 30 September 2016

Date accepted : 27 May 2017

Funding

Funded by: FundRef http://dx.doi.org/10.13039/100000001, National Science Foundation;

Award ID: CNS 1337899

Award Recipient : Jun Huan

Custom metadata

ScienceOpen disciplines: Chemoinformatics

Keywords: deep learning,sars,cheminformatics,machine-learning,data-mining,random forest,knn,support vector machines,naïve bayes

Data availability:

ScienceOpen disciplines: Chemoinformatics

Keywords: deep learning, sars, cheminformatics, machine-learning, data-mining, random forest, knn, support vector machines, naïve bayes

Deep-learning: investigating deep neural networks hyper-parameters and comparison of performance to shallow methods for modeling bioactivity data

Read this article at

Abstract

Background

Results

Electronic supplementary material

Related collections

Computer Vision, Deep Learning, Deep Reinforcement Learning, IoT

Most cited references 40

Individual Comparisons by Ranking Methods

Comparison of the predicted and observed secondary structure of T4 phage lysozyme.

Gene selection and classification of microarray data using random forest

Author and article information

Contributors

Journal

Affiliations

Author information

Article

History

Funding

Categories

Custom metadata

Comments

Comment on this article

Similar content 42

Cited by 68

Most referenced authors 1,275