Selection bias in plots of microarray or other data that have been sampled from a high-dimensional space

John H. Maindonald*, C. J. Burden

*Corresponding author for this work

    Research output: Contribution to journalArticlepeer-review

    2 Citations (Scopus)

    Abstract

    For data that have many more features than observations, finding a low-dimensional representation that accurately reflects known prior groupings is non-trivial. Microarray gene expression data, used to create a "signature" or discrimination rule that distinguishes cancer tissues that are classified according to type of cancer, is an important special case. The optimal number of features is suitably determined using cross-validation, in which each of several parts of the data becomes in turn the test set, with the remaining data used for training. At each such division of "fold" of the data into a training and test set, both the selection of features and the derivation of the discriminant rule must be repeated. Use of the complete data for prior selection of features can lead to a grossly optimistic assessment of predictive accuracy and, in scatter-plot graphs that show discriminant function scores, to a spurious or exaggerated separation between groups. At each division or fold, a second versus first discriminant axis plot of test scores can be drwan. This paper presents a method for bringing there different plosts, which have different choices of features and realte to different coordinate systems, into a single plot in which the configuration of points fairly reflects the accuracy of the discriminant procedure. The methodology is applicable, in prinsiple, to use of any discriminant analysis methodology, or of ordination or multidimensional scaling, for obtaining a low dimensional graphical representation of data.

    Original languageEnglish
    Pages (from-to)C59-C74
    JournalANZIAM Journal
    Volume46
    Issue number5 ELECTRONIC SUPPL.
    Publication statusPublished - 2004

    Fingerprint

    Dive into the research topics of 'Selection bias in plots of microarray or other data that have been sampled from a high-dimensional space'. Together they form a unique fingerprint.

    Cite this