20
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      Using recursive feature elimination in random forest to account for correlated variables in high dimensional data

      research-article
      , ,
      BMC Genetics
      BioMed Central
      Genetic Analysis Workshop 20 (GAW 20)
      4-8 March 2017
      Genomics, Genetics, Epigenomics, Methylation, Machine-learning, Omics, Integration, High-dimensional data, Random forest, Recursive feature elimination, Correlation

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Background

          Random forest (RF) is a machine-learning method that generally works well with high-dimensional problems and allows for nonlinear relationships between predictors; however, the presence of correlated predictors has been shown to impact its ability to identify strong predictors. The Random Forest-Recursive Feature Elimination algorithm (RF-RFE) mitigates this problem in smaller data sets, but this approach has not been tested in high-dimensional omics data sets.

          Results

          We integrated 202,919 genotypes and 153,422 methylation sites in 680 individuals, and compared the abilities of RF and RF-RFE to detect simulated causal associations, which included simulated genotype–methylation interactions, between these variables and triglyceride levels. Results show that RF was able to identify strong causal variables with a few highly correlated variables, but it did not detect other causal variables.

          Conclusions

          Although RF-RFE decreased the importance of correlated variables, in the presence of many correlated variables, it also decreased the importance of causal variables, making both hard to detect. These findings suggest that RF-RFE may not scale to high-dimensional data.

          Related collections

          Most cited references13

          • Record: found
          • Abstract: found
          • Article: not found

          Methods of integrating data to uncover genotype-phenotype interactions.

          Recent technological advances have expanded the breadth of available omic data, from whole-genome sequencing data, to extensive transcriptomic, methylomic and metabolomic data. A key goal of analyses of these data is the identification of effective models that predict phenotypic traits and outcomes, elucidating important biomarkers and generating important insights into the genetic underpinnings of the heritability of complex traits. There is still a need for powerful and advanced analysis strategies to fully harness the utility of these comprehensive high-throughput data, identifying true associations and reducing the number of false associations. In this Review, we explore the emerging approaches for data integration - including meta-dimensional and multi-staged analyses - which aim to deepen our understanding of the role of genetics and genomics in complex outcomes. With the use and further development of these approaches, an improved understanding of the relationship between genomic variation and human phenotypes may be revealed.
            Bookmark
            • Record: found
            • Abstract: not found
            • Article: not found

            Empirical characterization of random forest variable importance measures

              Bookmark
              • Record: found
              • Abstract: not found
              • Article: not found

              Machine Learning.

                Bookmark

                Author and article information

                Contributors
                bdarst@wisc.edu
                kmalecki@wisc.edu
                cengelman@wisc.edu
                Conference
                BMC Genet
                BMC Genet
                BMC Genetics
                BioMed Central (London )
                1471-2156
                17 September 2018
                17 September 2018
                2018
                : 19
                Issue : Suppl 1 Issue sponsor : Publication of the proceedings of Genetic Analysis Workshop 20 was supported by National Institutes of Health grant R01 GM031575. The articles have undergone the journal's standard peer review process for supplements. The Supplement Editors declare that they were not involved in the peer review process for any article on which they are an author. They declare no other competing interests.
                : 65
                Affiliations
                ISNI 0000 0001 0701 8607, GRID grid.28803.31, Department of Population Health Sciences, School of Medicine and Public Health, , University of Wisconsin, ; 610 Walnut Street, 1007 WARF, Madison, WI 53726 USA
                Article
                633
                10.1186/s12863-018-0633-8
                6157185
                30255764
                2c2b7740-4402-42d0-aa95-0b987c87b520
                © The Author(s). 2018

                Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License ( http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

                Genetic Analysis Workshop 20
                GAW 20
                San Diego, CA, USA
                4-8 March 2017
                History
                Categories
                Research
                Custom metadata
                © The Author(s) 2018

                Genetics
                genomics,genetics,epigenomics,methylation,machine-learning,omics,integration,high-dimensional data,random forest,recursive feature elimination,correlation

                Comments

                Comment on this article