TY - JOUR
T1 - Theoretical measures of relative performance of classifiers for high dimensional data with small sample sizes
AU - Hall, Peter
AU - Pittelkow, Yvonne
AU - Ghosh, Malay
PY - 2008/2
Y1 - 2008/2
N2 - We suggest a technique, related to the concept of 'detection boundary' that was developed by Ingster and by Donoho and Jin, for comparing the theoretical performance of classifiers constructed from small training samples of very large vectors. The resulting 'classification boundaries' are obtained for a variety of distance-based methods, including the support vector machine, distance-weighted discrimination and kth-nearest-neighbour classifiers, for thresholded forms of those methods, and for techniques based on Donoho and Jin's higher criticism approach to signal detection. Assessed in these terms, standard distance-based methods are shown to be capable only of detecting differences between populations when those differences can be estimated consistently. However, the thresholded forms of distance-based classifiers can do better, and in particular can correctly classify data even when differences between distributions are only detectable, not estimable. Other methods, including higher criticism classifiers, can on occasion perform better still, but they tend to be more limited in scope, requiring substantially more information about the marginal distributions. Moreover, as tail weight becomes heavier the classification boundaries of methods designed for particular distribution types can converge to, and achieve, the boundary for thresholded nearest neighbour approaches. For example, although higher criticism has a lower classification boundary, and in this sense performs better, in the case of normal data, the boundaries are identical for exponentially distributed data when both sample sizes equal 1.
AB - We suggest a technique, related to the concept of 'detection boundary' that was developed by Ingster and by Donoho and Jin, for comparing the theoretical performance of classifiers constructed from small training samples of very large vectors. The resulting 'classification boundaries' are obtained for a variety of distance-based methods, including the support vector machine, distance-weighted discrimination and kth-nearest-neighbour classifiers, for thresholded forms of those methods, and for techniques based on Donoho and Jin's higher criticism approach to signal detection. Assessed in these terms, standard distance-based methods are shown to be capable only of detecting differences between populations when those differences can be estimated consistently. However, the thresholded forms of distance-based classifiers can do better, and in particular can correctly classify data even when differences between distributions are only detectable, not estimable. Other methods, including higher criticism classifiers, can on occasion perform better still, but they tend to be more limited in scope, requiring substantially more information about the marginal distributions. Moreover, as tail weight becomes heavier the classification boundaries of methods designed for particular distribution types can converge to, and achieve, the boundary for thresholded nearest neighbour approaches. For example, although higher criticism has a lower classification boundary, and in this sense performs better, in the case of normal data, the boundaries are identical for exponentially distributed data when both sample sizes equal 1.
KW - Classification boundary
KW - Detection
KW - Distance-based classification
KW - Distance-weighted discrimination
KW - Higher criticism
KW - Nearest neighbour method
KW - Sparsity
KW - Support vector machine
KW - Thresholding
KW - Truncation
UR - http://www.scopus.com/inward/record.url?scp=37849031233&partnerID=8YFLogxK
U2 - 10.1111/j.1467-9868.2007.00631.x
DO - 10.1111/j.1467-9868.2007.00631.x
M3 - Article
SN - 1369-7412
VL - 70
SP - 159
EP - 173
JO - Journal of the Royal Statistical Society. Series B: Statistical Methodology
JF - Journal of the Royal Statistical Society. Series B: Statistical Methodology
IS - 1
ER -