Abstract
Large, high-dimensional datasets containing different types of variables are becoming increasingly common. For exploring such data, there is a need for integrated methods. For example, a single genomic experiment can contain large quantities of different types of data (including clinical data) that make it a challenge to coherently describe the patterns of variability within and between the inter-related datasets. Mutual information (MI) is a widely used information theoretic dependency measure that also can identify nonlinear and nonmonotonic associations. First, we develop a computationally efficient implementation of MI between a discrete and a continuous variable. This implementation allows us to apply a coherent approach to all comparisons arising from continuous and categorical data. As commonly applied, MI can have high levels of bias. So we present a novel development of mutual information (MI) that reduces the bias, and that we term bias corrected mutual information (BCMI). Further, BCMI is useful as an association measure that can be incorporated in subsequent analyses such as clustering and visualisation procedures. To demonstrate our approach, a genomic dataset is re-examined. This dataset contains single nucleotide polymorphisms (SNPs, a discrete variable), gene expression levels and clinical data (all continuous variables). Our approach allows us to integrate these different types of data by exploring associations both within and between these types of variables.
Original language | English |
---|---|
Pages (from-to) | 178-199 |
Number of pages | 22 |
Journal | Annals of Applied Statistics |
Volume | 12 |
Issue number | 1 |
DOIs | |
Publication status | Published - Mar 2018 |