Identification of clinical factors related to prediction of alcohol use disorder from electronic health records using feature selection methods

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

Background

High dimensionality in electronic health records (EHR) causes a significant computational problem for any systematic search for predictive, diagnostic, or prognostic patterns. Feature selection (FS) methods have been indicated to be effective in feature reduction as well as in identifying risk factors related to prediction of clinical disorders. This paper examines the prediction of patients with alcohol use disorder (AUD) using machine learning (ML) and attempts to identify risk factors related to the diagnosis of AUD.

Methods

A FS framework consisting of two operational levels, base selectors and ensemble selectors. The first level consists of five FS methods: three filter methods, one wrapper method, and one embedded method. Base selector outputs are aggregated to develop four ensemble FS methods. The outputs of FS method were then fed into three ML algorithms: support vector machine (SVM), K-nearest neighbor (KNN), and random forest (RF) to compare and identify the best feature subset for the prediction of AUD from EHRs.

Results

In terms of feature reduction, the embedded FS method could significantly reduce the number of features from 361 to 131. In terms of classification performance, RF based on 272 features selected by our proposed ensemble method (Union FS) with the highest accuracy in predicting patients with AUD, 96%, outperformed all other models in terms of AUROC, AUPRC, Precision, Recall, and F1-Score. Considering the limitations of embedded and wrapper methods, the best overall performance was achieved by our proposed Union Filter FS, which reduced the number of features to 223 and improved Precision, Recall, and F1-Score in RF from 0.77, 0.65, and 0.71 to 0.87, 0.81, and 0.84, respectively. Our findings indicate that, besides gender, age, and length of stay at the hospital, diagnosis related to digestive organs, bones, muscles and connective tissue, and the nervous systems are important clinical factors related to the prediction of patients with AUD.

Conclusion

Our proposed FS method could improve the classification performance significantly. It could identify clinical factors related to prediction of AUD from EHRs, thereby effectively helping clinical staff to identify and treat AUD patients and improving medical knowledge of the AUD condition. Moreover, the diversity of features among female and male patients as well as gender disparity were investigated using FS methods and ML techniques.

Related collections

Most cited references 59

Record: found
Abstract: not found
Article: not found

Random Forests

Leo Breiman (2001)

0 comments Cited 7521 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

SMOTE: Synthetic Minority Over-sampling Technique

N. Chawla, K. W. Bowyer, L Hall … (2002)

An approach to the construction of classifiers from imbalanced datasets is described. A dataset is imbalanced if the classification categories are not approximately equally represented. Often real-world data sets are predominately composed of ``normal'' examples with only a small percentage of ``abnormal'' or ``interesting'' examples. It is also the case that the cost of misclassifying an abnormal (interesting) example as a normal example is often much higher than the cost of the reverse error. Under-sampling of the majority (normal) class has been proposed as a good means of increasing the sensitivity of a classifier to the minority class. This paper shows that a combination of our method of over-sampling the minority (abnormal) class and under-sampling the majority (normal) class can achieve better classifier performance (in ROC space) than only under-sampling the majority class. This paper also shows that a combination of our method of over-sampling the minority class and under-sampling the majority class can achieve better classifier performance (in ROC space) than varying the loss ratios in Ripper or class priors in Naive Bayes. Our method of over-sampling the minority class involves creating synthetic minority class examples. Experiments are performed using C4.5, Ripper and a Naive Bayes classifier. The method is evaluated using the area under the Receiver Operating Characteristic curve (AUC) and the ROC convex hull strategy.

0 comments Cited 2781 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: not found
Article: not found

Wrappers for feature subset selection

Ron Kohavi, George H. John (1997)

0 comments Cited 1060 times – based on 0 reviews      Review now

Bookmark

All references

Author and article information

Contributors

Ali Ebrahimi: aleb@mmmi.sdu.dk

Uffe Kock Wiil: ukwiil@mmmi.sdu.dk

Amin Naemi: amin@mmmi.sdu.dk

Marjan Mansourvar: marm@imada.sdu.dk

Kjeld Andersen: kjeld.andersen@rsyd.dk

Anette Søgaard Nielsen: ansnielsen@health.sdu.dk

Journal

Journal ID (nlm-ta): BMC Med Inform Decis Mak

Journal ID (iso-abbrev): BMC Med Inform Decis Mak

Title: BMC Medical Informatics and Decision Making

Publisher: BioMed Central (London )

ISSN (Electronic): 1472-6947

Publication date (Electronic): 23 November 2022

Publication date PMC-release: 23 November 2022

Publication date Collection: 2022

Volume: 22

Electronic Location Identifier: 304

Affiliations

[1 ]GRID grid.10825.3e, ISNI 0000 0001 0728 0170, SDU Health Informatics and Technology, The Maersk Mc-Kinney Moller Institute, , University of Southern Denmark, ; Odense, Denmark

[2 ]GRID grid.10825.3e, ISNI 0000 0001 0728 0170, Department of Mathematics and Computer Science, , University of Southern Denmark, ; Odense, Denmark

[3 ]GRID grid.10825.3e, ISNI 0000 0001 0728 0170, Unit for Clinical Alcohol Research, Clinical Institute, , University of Southern Denmark, ; Odense, Denmark

Article

Publisher ID: 2051

DOI: 10.1186/s12911-022-02051-w

PMC ID: 9686074

PubMed ID: 34983500

SO-VID: ed3219a8-1d35-4689-9485-b3010097086c

License:

Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

History

Date received : 19 December 2021

Date accepted : 16 November 2022

Funding

Funded by: 5a DE-DK project Access & Acceleration

Custom metadata

ScienceOpen disciplines: Bioinformatics & Computational biology

Keywords: alcohol use disorder,clinical factor identification,gender disparity,machine learning,feature selection

Data availability:

ScienceOpen disciplines: Bioinformatics & Computational biology

Keywords: alcohol use disorder, clinical factor identification, gender disparity, machine learning, feature selection

Identification of clinical factors related to prediction of alcohol use disorder from electronic health records using feature selection methods

Read this article at

Abstract

Background

Methods

Results

Conclusion

Related collections

REPO4EU

Most cited references 59

Random Forests

SMOTE: Synthetic Minority Over-sampling Technique

Wrappers for feature subset selection

Author and article information

Contributors

Journal

Affiliations

Article

History

Funding

Categories

Custom metadata

Comments

Comment on this article

Similar content 101

Cited by 3

Most referenced authors 862