Machine Learning and Integrative Analysis of Biomedical Big Data

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

Recent developments in high-throughput technologies have accelerated the accumulation of massive amounts of omics data from multiple sources: genome, epigenome, transcriptome, proteome, metabolome, etc. Traditionally, data from each source (e.g., genome) is analyzed in isolation using statistical and machine learning (ML) methods. Integrative analysis of multi-omics and clinical data is key to new biomedical discoveries and advancements in precision medicine. However, data integration poses new computational challenges as well as exacerbates the ones associated with single-omics studies. Specialized computational approaches are required to effectively and efficiently perform integrative analysis of biomedical data acquired from diverse modalities. In this review, we discuss state-of-the-art ML-based approaches for tackling five specific computational challenges associated with integrative analysis: curse of dimensionality, data heterogeneity, missing data, class imbalance and scalability issues.

Related collections

Most cited references 172

Record: found
Abstract: found
Article: not found

SMOTE: Synthetic Minority Over-sampling Technique

N. Chawla, K. W. Bowyer, L Hall … (2002)

An approach to the construction of classifiers from imbalanced datasets is described. A dataset is imbalanced if the classification categories are not approximately equally represented. Often real-world data sets are predominately composed of ``normal'' examples with only a small percentage of ``abnormal'' or ``interesting'' examples. It is also the case that the cost of misclassifying an abnormal (interesting) example as a normal example is often much higher than the cost of the reverse error. Under-sampling of the majority (normal) class has been proposed as a good means of increasing the sensitivity of a classifier to the minority class. This paper shows that a combination of our method of over-sampling the minority (abnormal) class and under-sampling the majority (normal) class can achieve better classifier performance (in ROC space) than only under-sampling the majority class. This paper also shows that a combination of our method of over-sampling the minority class and under-sampling the majority class can achieve better classifier performance (in ROC space) than varying the loss ratios in Ripper or class priors in Naive Bayes. Our method of over-sampling the minority class involves creating synthetic minority class examples. Experiments are performed using C4.5, Ripper and a Naive Bayes classifier. The method is evaluated using the area under the Receiver Operating Characteristic curve (AUC) and the ROC convex hull strategy.

0 comments Cited 2734 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: found

Is Open Access

mixOmics: An R package for ‘omics feature selection and multiple data integration

Florian Rohart, Benoît Gautier, Amrit Singh … (2017)

The advent of high throughput technologies has led to a wealth of publicly available ‘omics data coming from different sources, such as transcriptomics, proteomics, metabolomics. Combining such large-scale biological data sets can lead to the discovery of important biological insights, provided that relevant information can be extracted in a holistic manner. Current statistical approaches have been focusing on identifying small subsets of molecules (a ‘molecular signature’) to explain or predict biological conditions, but mainly for a single type of ‘omics. In addition, commonly used methods are univariate and consider each biological feature independently. We introduce mixOmics, an R package dedicated to the multivariate analysis of biological data sets with a specific focus on data exploration, dimension reduction and visualisation. By adopting a systems biology approach, the toolkit provides a wide range of methods that statistically integrate several data sets at once to probe relationships between heterogeneous ‘omics data sets. Our recent methods extend Projection to Latent Structure (PLS) models for discriminant analysis, for data integration across multiple ‘omics data or across independent studies, and for the identification of molecular signatures. We illustrate our latest mixOmics integrative frameworks for the multivariate analyses of ‘omics data available from the package.

0 comments Cited 1139 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: not found
Article: not found

MapReduce

Jeffrey S. Dean, Sanjay Ghemawat (2008)

0 comments Cited 943 times     Rated -3 of 5. – based on 1 reviews

Bookmark

All references

Author and article information

Journal

Journal ID (nlm-ta): Genes (Basel)

Journal ID (iso-abbrev): Genes (Basel)

Journal ID (publisher-id): genes

Title: Genes

Publisher: MDPI

ISSN (Electronic): 2073-4425

Publication date (Electronic): 28 January 2019

Publication date Collection: February 2019

Volume: 10

Issue: 2

Electronic Location Identifier: 87

Affiliations

[1 ]NIH BD2K Center of Excellence for Biomedical Computing, University of California Los Angeles, Los Angeles, CA 90095, USA; weiwang@ 123456cs.ucla.edu (W.W.); jw744@ 123456g.ucla.edu (J.W.); cjh9595@ 123456g.ucla.edu (H.C.); nchchung@ 123456gmail.com (N.C.C.)

[2 ]Department of Physiology, University of California Los Angeles, Los Angeles, CA 90095, USA

[3 ]Department of Computer Science, University of California Los Angeles, Los Angeles, CA 90095, USA

[4 ]Scalable Analytics Institute (ScAi), University of California Los Angeles, Los Angeles, CA 90095, USA

[5 ]Department of Bioinformatics, University of California Los Angeles, Los Angeles, CA 90095, USA

[6 ]Institute of Informatics, Faculty of Mathematics, Informatics and Mechanics, University of Warsaw, Banacha 2, 02-097 Warsaw, Poland

[7 ]Department of Medicine (Cardiology), University of California Los Angeles, Los Angeles, CA 90095, USA

Author notes

[* ]Correspondence: bmirza@ 123456mednet.ucla.edu (B.M.); pping38@ 123456g.ucla.edu (P.P.); Tel.: +1-310-267-5624 (P.P.)

Author information

Howard Choi https://orcid.org/0000-0001-5080-2966

Neo Christopher Chung https://orcid.org/0000-0001-6798-8867

Article

Publisher ID: genes-10-00087

DOI: 10.3390/genes10020087

PMC ID: 6410075

PubMed ID: 30696086

SO-VID: 9395360e-7bfb-4920-980b-dc97f07d11a4

License:

Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license ( http://creativecommons.org/licenses/by/4.0/).

Machine Learning and Integrative Analysis of Biomedical Big Data

Read this article at

Abstract

Related collections

Annual Reviews AI, Machine Learning, and Society

Most cited references 172

SMOTE: Synthetic Minority Over-sampling Technique

mixOmics: An R package for ‘omics feature selection and multiple data integration

MapReduce

Author and article information

Journal

Affiliations

Author notes

Author information

Article

History

Categories

Comments

Comment on this article

Similar content 414

Cited by 105

Most referenced authors 4,924