Accurate Diabetes Risk Stratification Using Machine Learning: Role of Missing Value and Outliers

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

Diabetes mellitus is a group of metabolic diseases in which blood sugar levels are too high. About 8.8% of the world was diabetic in 2017. It is projected that this will reach nearly 10% by 2045. The major challenge is that when machine learning-based classifiers are applied to such data sets for risk stratification, leads to lower performance. Thus, our objective is to develop an optimized and robust machine learning (ML) system under the assumption that missing values or outliers if replaced by a median configuration will yield higher risk stratification accuracy. This ML-based risk stratification is designed, optimized and evaluated, where: (i) the features are extracted and optimized from the six feature selection techniques (random forest, logistic regression, mutual information, principal component analysis, analysis of variance, and Fisher discriminant ratio) and combined with ten different types of classifiers (linear discriminant analysis, quadratic discriminant analysis, naïve Bayes, Gaussian process classification, support vector machine, artificial neural network, Adaboost, logistic regression, decision tree, and random forest) under the hypothesis that both missing values and outliers when replaced by computed medians will improve the risk stratification accuracy. Pima Indian diabetic dataset (768 patients: 268 diabetic and 500 controls) was used. Our results demonstrate that on replacing the missing values and outliers by group median and median values, respectively and further using the combination of random forest feature selection and random forest classification technique yields an accuracy, sensitivity, specificity, positive predictive value, negative predictive value and area under the curve as: 92.26%, 95.96%, 79.72%, 91.14%, 91.20%, and 0.93, respectively. This is an improvement of 10% over previously developed techniques published in literature. The system was validated for its stability and reliability. RF-based model showed the best performance when outliers are replaced by median values.

Electronic supplementary material

The online version of this article (10.1007/s10916-018-0940-7) contains supplementary material, which is available to authorized users.

Related collections

Most cited references 31

Record: found
Abstract: not found
Article: not found

Detecting outliers: Do not use standard deviation around the mean, use absolute deviation around the median

Christophe Leys, Christophe Ley, Olivier Klein … (2013)

0 comments Cited 818 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: not found
Article: not found

Geometrical and Statistical Properties of Systems of Linear Inequalities with Applications in Pattern Recognition

Thomas M. Cover (1965)

0 comments Cited 295 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: not found
Article: not found

The global implications of diabetes and cancer.

Yuankai Shi, Frank Hu (2014)

0 comments Cited 142 times – based on 0 reviews      Review now

Bookmark

All references

Author and article information

Contributors

Jasjit S. Suri: (916)-749-5628 , jasjit.suri@atheropoint.com

Journal

Journal ID (nlm-ta): J Med Syst

Journal ID (iso-abbrev): J Med Syst

Title: Journal of Medical Systems

Publisher: Springer US (New York )

ISSN (Print): 0148-5598

ISSN (Electronic): 1573-689X

Publication date (Electronic): 10 April 2018

Publication date PMC-release: 10 April 2018

Publication date (Print): 2018

Volume: 42

Issue: 5

Electronic Location Identifier: 92

Affiliations

[1 ]ISNI 0000 0004 0451 7306, GRID grid.412656.2, Department of Statistics, , University of Rajshahi, ; Rajshahi, Bangladesh

[2 ]The JiVitA Project of Johns Hopkins University, Gaibandha, Bangladesh

[3 ]GRID grid.443086.d, Department of Computer Science and Engineering, , Rajshahi University of Engineering and Technology, ; Rajshahi, Bangladesh

[4 ]ISNI 0000 0004 1936 9094, GRID grid.40263.33, Brown University, ; Providence, RI USA

[5 ]ISNI 0000 0001 0441 1219, GRID grid.412118.f, Statistics Discipline, , Khulna University, ; Khulna, Bangladesh

[6 ]ISNI 0000 0001 2113 1622, GRID grid.266623.5, Department of Bioengineering, , University of Louisville, ; Louisville, KY USA

[7 ]Stroke Monitoring and Diagnostic Division, AtheroPoint LLC, Roseville, CA USA

[8 ]Knowledge Engineering Center, Global Biomedical Technologies, Roseville, CA USA

Article

Publisher ID: 940

DOI: 10.1007/s10916-018-0940-7

PMC ID: 5893681

PubMed ID: 29637403

SO-VID: af318f0f-345d-4260-aadc-f6efa5887ec9

License:

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

History

Date received : 13 February 2018

Date revision received : 7 March 2018

Date accepted : 14 March 2018

Custom metadata

ScienceOpen disciplines: Public health

Keywords: diabetes,missing values,outliers,risk stratification,feature selection,machine learning

Data availability:

ScienceOpen disciplines: Public health

Keywords: diabetes, missing values, outliers, risk stratification, feature selection, machine learning

Accurate Diabetes Risk Stratification Using Machine Learning: Role of Missing Value and Outliers

Read this article at

Abstract

Electronic supplementary material

Related collections

Karger: Digital Health

Most cited references 31

Detecting outliers: Do not use standard deviation around the mean, use absolute deviation around the median

Geometrical and Statistical Properties of Systems of Linear Inequalities with Applications in Pattern Recognition

The global implications of diabetes and cancer.

Author and article information

Contributors

Journal

Affiliations

Article

History

Categories

Custom metadata

Comments

Comment on this article

Similar content 145

Cited by 55

Most referenced authors 1,376