SMOTE for high-dimensional class-imbalanced data

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

Background

Classification using class-imbalanced data is biased in favor of the majority class. The bias is even larger for high-dimensional data, where the number of variables greatly exceeds the number of samples. The problem can be attenuated by undersampling or oversampling, which produce class-balanced data. Generally undersampling is helpful, while random oversampling is not. Synthetic Minority Oversampling TEchnique (SMOTE) is a very popular oversampling method that was proposed to improve random oversampling but its behavior on high-dimensional data has not been thoroughly investigated. In this paper we investigate the properties of SMOTE from a theoretical and empirical point of view, using simulated and real high-dimensional data.

Results

While in most cases SMOTE seems beneficial with low-dimensional data, it does not attenuate the bias towards the classification in the majority class for most classifiers when data are high-dimensional, and it is less effective than random undersampling. SMOTE is beneficial for k-NN classifiers for high-dimensional data if the number of variables is reduced performing some type of variable selection; we explain why, otherwise, the k-NN classification is biased towards the minority class. Furthermore, we show that on high-dimensional data SMOTE does not change the class-specific mean values while it decreases the data variability and it introduces correlation between samples. We explain how our findings impact the class-prediction for high-dimensional data.

Conclusions

In practice, in the high-dimensional setting only k-NN classifiers based on the Euclidean distance seem to benefit substantially from the use of SMOTE, provided that variable selection is performed before using SMOTE; the benefit is larger if more neighbors are used. SMOTE for k-NN without variable selection should not be used, because it strongly biases the classification towards the minority class.

Related collections

Most cited references 22

Record: found
Abstract: found
Article: not found

Diagnosis of multiple cancer types by shrunken centroids of gene expression.

R Tibshirani, T. Hastie, B. Narasimhan … (2002)

We have devised an approach to cancer class prediction from gene expression profiling, based on an enhancement of the simple nearest prototype (centroid) classifier. We shrink the prototypes and hence obtain a classifier that is often more accurate than competing methods. Our method of "nearest shrunken centroids" identifies subsets of genes that best characterize each class. The technique is general and can be used in many other classification problems. To demonstrate its effectiveness, we show that the method was highly efficient in finding genes for classifying small round blue cell tumors and leukemias.

0 comments Cited 155 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: not found
Article: not found

Discriminatory Analysis. Nonparametric Discrimination: Consistency Properties

Evelyn Fix, J. L. Hodges (1989)

0 comments Cited 133 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

Breast cancer classification and prognosis based on gene expression profiles from a population-based study.

Christos Sotiriou, Soek-Ying Neo, Lisa McShane … (2003)

Comprehensive gene expression patterns generated from cDNA microarrays were correlated with detailed clinico-pathological characteristics and clinical outcome in an unselected group of 99 node-negative and node-positive breast cancer patients. Gene expression patterns were found to be strongly associated with estrogen receptor (ER) status and moderately associated with grade, but not associated with menopausal status, nodal status, or tumor size. Hierarchical cluster analysis segregated the tumors into two main groups based on their ER status, which correlated well with basal and luminal characteristics. Cox proportional hazards regression analysis identified 16 genes that were significantly associated with relapse-free survival at a stringent significance level of 0.001 to account for multiple comparisons. Of 231 genes previously reported by others [van't Veer, L. J., et al. (2002) Nature 415, 530-536] as being associated with survival, 93 probe elements overlapped with the set of 7,650 probe elements represented on the arrays used in this study. Hierarchical cluster analysis based on the set of 93 probe elements segregated our population into two distinct subgroups with different relapse-free survival (P < 0.03). The number of these 93 probe elements showing significant univariate association with relapse-free survival (P < 0.05) in the present study was 14, representing 11 unique genes. Genes involved in cell cycle, DNA replication, and chromosomal stability were consistently elevated in the various poor prognostic groups. In addition, glutathione S-transferase M3 emerged as an important survival marker in both studies. When taken together with other array studies, our results highlight the consistent biological and clinical associations with gene expression profiles.

0 comments Cited 101 times – based on 0 reviews      Review now

Bookmark

All references

Author and article information

Journal

Journal ID (nlm-ta): BMC Bioinformatics

Journal ID (iso-abbrev): BMC Bioinformatics

Title: BMC Bioinformatics

Publisher: BioMed Central

ISSN (Electronic): 1471-2105

Publication date Collection: 2013

Publication date (Electronic): 22 March 2013

Volume: 14

Page: 106

Affiliations

[1 ]Institute for Biostatistics and Medical Informatics, University of Ljubljana, Vrazov trg 2, Ljubljana, Slovenia

Article

Publisher ID: 1471-2105-14-106

DOI: 10.1186/1471-2105-14-106

PMC ID: 3648438

PubMed ID: 23522326

SO-VID: 732c6452-92b5-4762-a7d7-9fe0e3a41efc

License:

This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

History

Date received : 24 July 2012

Date accepted : 22 February 2013

Comments

Comment on this article

scite_

Cited by 218

See all cited by

Most referenced authors 507

See all reference authors

- Version 1

SMOTE for high-dimensional class-imbalanced data

Read this article at

Abstract

Background

Results

Conclusions

Related collections

REPO4EU WP2 Databases

Most cited references 22

Diagnosis of multiple cancer types by shrunken centroids of gene expression.

Discriminatory Analysis. Nonparametric Discrimination: Consistency Properties

Breast cancer classification and prognosis based on gene expression profiles from a population-based study.

Author and article information

Journal

Affiliations

Article

History

Categories

Comments

Comment on this article

Similar content 243

Cited by 218

Most referenced authors 507