49
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      Machine Learning and Integrative Analysis of Biomedical Big Data

      review-article

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Recent developments in high-throughput technologies have accelerated the accumulation of massive amounts of omics data from multiple sources: genome, epigenome, transcriptome, proteome, metabolome, etc. Traditionally, data from each source (e.g., genome) is analyzed in isolation using statistical and machine learning (ML) methods. Integrative analysis of multi-omics and clinical data is key to new biomedical discoveries and advancements in precision medicine. However, data integration poses new computational challenges as well as exacerbates the ones associated with single-omics studies. Specialized computational approaches are required to effectively and efficiently perform integrative analysis of biomedical data acquired from diverse modalities. In this review, we discuss state-of-the-art ML-based approaches for tackling five specific computational challenges associated with integrative analysis: curse of dimensionality, data heterogeneity, missing data, class imbalance and scalability issues.

          Related collections

          Most cited references172

          • Record: found
          • Abstract: found
          • Article: not found

          SMOTE: Synthetic Minority Over-sampling Technique

          An approach to the construction of classifiers from imbalanced datasets is described. A dataset is imbalanced if the classification categories are not approximately equally represented. Often real-world data sets are predominately composed of ``normal'' examples with only a small percentage of ``abnormal'' or ``interesting'' examples. It is also the case that the cost of misclassifying an abnormal (interesting) example as a normal example is often much higher than the cost of the reverse error. Under-sampling of the majority (normal) class has been proposed as a good means of increasing the sensitivity of a classifier to the minority class. This paper shows that a combination of our method of over-sampling the minority (abnormal) class and under-sampling the majority (normal) class can achieve better classifier performance (in ROC space) than only under-sampling the majority class. This paper also shows that a combination of our method of over-sampling the minority class and under-sampling the majority class can achieve better classifier performance (in ROC space) than varying the loss ratios in Ripper or class priors in Naive Bayes. Our method of over-sampling the minority class involves creating synthetic minority class examples. Experiments are performed using C4.5, Ripper and a Naive Bayes classifier. The method is evaluated using the area under the Receiver Operating Characteristic curve (AUC) and the ROC convex hull strategy.
            Bookmark
            • Record: found
            • Abstract: found
            • Article: found
            Is Open Access

            mixOmics: An R package for ‘omics feature selection and multiple data integration

            The advent of high throughput technologies has led to a wealth of publicly available ‘omics data coming from different sources, such as transcriptomics, proteomics, metabolomics. Combining such large-scale biological data sets can lead to the discovery of important biological insights, provided that relevant information can be extracted in a holistic manner. Current statistical approaches have been focusing on identifying small subsets of molecules (a ‘molecular signature’) to explain or predict biological conditions, but mainly for a single type of ‘omics. In addition, commonly used methods are univariate and consider each biological feature independently. We introduce mixOmics, an R package dedicated to the multivariate analysis of biological data sets with a specific focus on data exploration, dimension reduction and visualisation. By adopting a systems biology approach, the toolkit provides a wide range of methods that statistically integrate several data sets at once to probe relationships between heterogeneous ‘omics data sets. Our recent methods extend Projection to Latent Structure (PLS) models for discriminant analysis, for data integration across multiple ‘omics data or across independent studies, and for the identification of molecular signatures. We illustrate our latest mixOmics integrative frameworks for the multivariate analyses of ‘omics data available from the package.
              Bookmark
              • Record: found
              • Abstract: not found
              • Article: not found

              MapReduce

                Bookmark

                Author and article information

                Journal
                Genes (Basel)
                Genes (Basel)
                genes
                Genes
                MDPI
                2073-4425
                28 January 2019
                February 2019
                : 10
                : 2
                : 87
                Affiliations
                [1 ]NIH BD2K Center of Excellence for Biomedical Computing, University of California Los Angeles, Los Angeles, CA 90095, USA; weiwang@ 123456cs.ucla.edu (W.W.); jw744@ 123456g.ucla.edu (J.W.); cjh9595@ 123456g.ucla.edu (H.C.); nchchung@ 123456gmail.com (N.C.C.)
                [2 ]Department of Physiology, University of California Los Angeles, Los Angeles, CA 90095, USA
                [3 ]Department of Computer Science, University of California Los Angeles, Los Angeles, CA 90095, USA
                [4 ]Scalable Analytics Institute (ScAi), University of California Los Angeles, Los Angeles, CA 90095, USA
                [5 ]Department of Bioinformatics, University of California Los Angeles, Los Angeles, CA 90095, USA
                [6 ]Institute of Informatics, Faculty of Mathematics, Informatics and Mechanics, University of Warsaw, Banacha 2, 02-097 Warsaw, Poland
                [7 ]Department of Medicine (Cardiology), University of California Los Angeles, Los Angeles, CA 90095, USA
                Author notes
                [* ]Correspondence: bmirza@ 123456mednet.ucla.edu (B.M.); pping38@ 123456g.ucla.edu (P.P.); Tel.: +1-310-267-5624 (P.P.)
                Author information
                https://orcid.org/0000-0001-5080-2966
                https://orcid.org/0000-0001-6798-8867
                Article
                genes-10-00087
                10.3390/genes10020087
                6410075
                30696086
                9395360e-7bfb-4920-980b-dc97f07d11a4
                © 2019 by the authors.

                Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license ( http://creativecommons.org/licenses/by/4.0/).

                History
                : 02 December 2018
                : 21 January 2019
                Categories
                Review

                machine learning,multi-omics,data integration,curse of dimensionality,heterogeneous data,missing data,class imbalance,scalability

                Comments

                Comment on this article