49
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      SMOTE for high-dimensional class-imbalanced data

      research-article
      1 , 1 ,
      BMC Bioinformatics
      BioMed Central

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Background

          Classification using class-imbalanced data is biased in favor of the majority class. The bias is even larger for high-dimensional data, where the number of variables greatly exceeds the number of samples. The problem can be attenuated by undersampling or oversampling, which produce class-balanced data. Generally undersampling is helpful, while random oversampling is not. Synthetic Minority Oversampling TEchnique (SMOTE) is a very popular oversampling method that was proposed to improve random oversampling but its behavior on high-dimensional data has not been thoroughly investigated. In this paper we investigate the properties of SMOTE from a theoretical and empirical point of view, using simulated and real high-dimensional data.

          Results

          While in most cases SMOTE seems beneficial with low-dimensional data, it does not attenuate the bias towards the classification in the majority class for most classifiers when data are high-dimensional, and it is less effective than random undersampling. SMOTE is beneficial for k-NN classifiers for high-dimensional data if the number of variables is reduced performing some type of variable selection; we explain why, otherwise, the k-NN classification is biased towards the minority class. Furthermore, we show that on high-dimensional data SMOTE does not change the class-specific mean values while it decreases the data variability and it introduces correlation between samples. We explain how our findings impact the class-prediction for high-dimensional data.

          Conclusions

          In practice, in the high-dimensional setting only k-NN classifiers based on the Euclidean distance seem to benefit substantially from the use of SMOTE, provided that variable selection is performed before using SMOTE; the benefit is larger if more neighbors are used. SMOTE for k-NN without variable selection should not be used, because it strongly biases the classification towards the minority class.

          Related collections

          Most cited references22

          • Record: found
          • Abstract: found
          • Article: not found

          Diagnosis of multiple cancer types by shrunken centroids of gene expression.

          We have devised an approach to cancer class prediction from gene expression profiling, based on an enhancement of the simple nearest prototype (centroid) classifier. We shrink the prototypes and hence obtain a classifier that is often more accurate than competing methods. Our method of "nearest shrunken centroids" identifies subsets of genes that best characterize each class. The technique is general and can be used in many other classification problems. To demonstrate its effectiveness, we show that the method was highly efficient in finding genes for classifying small round blue cell tumors and leukemias.
            Bookmark
            • Record: found
            • Abstract: not found
            • Article: not found

            Discriminatory Analysis. Nonparametric Discrimination: Consistency Properties

              Bookmark
              • Record: found
              • Abstract: found
              • Article: not found

              Breast cancer classification and prognosis based on gene expression profiles from a population-based study.

              Comprehensive gene expression patterns generated from cDNA microarrays were correlated with detailed clinico-pathological characteristics and clinical outcome in an unselected group of 99 node-negative and node-positive breast cancer patients. Gene expression patterns were found to be strongly associated with estrogen receptor (ER) status and moderately associated with grade, but not associated with menopausal status, nodal status, or tumor size. Hierarchical cluster analysis segregated the tumors into two main groups based on their ER status, which correlated well with basal and luminal characteristics. Cox proportional hazards regression analysis identified 16 genes that were significantly associated with relapse-free survival at a stringent significance level of 0.001 to account for multiple comparisons. Of 231 genes previously reported by others [van't Veer, L. J., et al. (2002) Nature 415, 530-536] as being associated with survival, 93 probe elements overlapped with the set of 7,650 probe elements represented on the arrays used in this study. Hierarchical cluster analysis based on the set of 93 probe elements segregated our population into two distinct subgroups with different relapse-free survival (P < 0.03). The number of these 93 probe elements showing significant univariate association with relapse-free survival (P < 0.05) in the present study was 14, representing 11 unique genes. Genes involved in cell cycle, DNA replication, and chromosomal stability were consistently elevated in the various poor prognostic groups. In addition, glutathione S-transferase M3 emerged as an important survival marker in both studies. When taken together with other array studies, our results highlight the consistent biological and clinical associations with gene expression profiles.
                Bookmark

                Author and article information

                Journal
                BMC Bioinformatics
                BMC Bioinformatics
                BMC Bioinformatics
                BioMed Central
                1471-2105
                2013
                22 March 2013
                : 14
                : 106
                Affiliations
                [1 ]Institute for Biostatistics and Medical Informatics, University of Ljubljana, Vrazov trg 2, Ljubljana, Slovenia
                Article
                1471-2105-14-106
                10.1186/1471-2105-14-106
                3648438
                23522326
                732c6452-92b5-4762-a7d7-9fe0e3a41efc
                Copyright ©2013 Blagus and Lusa; licensee BioMed Central Ltd.

                This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

                History
                : 24 July 2012
                : 22 February 2013
                Categories
                Research Article

                Bioinformatics & Computational biology
                Bioinformatics & Computational biology

                Comments

                Comment on this article