TY - JOUR

T1 - Cross-validation and the estimation of conditional probability densities

AU - Hall, Peter

AU - Racine, Jeff

AU - Li, Qi

PY - 2004/12

Y1 - 2004/12

N2 - Many practical problems, especially some connected with forecasting, require nonparametric estimation of conditional densities from mixed data. For example, given an explanatory data vector X for a prospective customer, with components that could include the customer's salary. occupation, age, sex, marital status, and address, a company might wish to estimate the density of the expenditure. Y, that could he made by that person, basing the inference on observations of (X,Y) for previous clients. Choosing appropriate smoothing parameters for this problem can he tricky, not in the least because plug-in rules take a particularly complex form in the case of mixed data. An obvious difliculty is that there exists no general formula for the optimal smoothing parameters. More insidiously, and more seriously, it can be difticult to determine which components of X are relevant to the problem of conditional inference. For example, if the jth component of X is independent of Y, then that component is irrelevant to estimating the density of Y given X, and ideally should he dropped before conducting inference. In this article we show that cross-validation overcomes these difficulties. It automatically determines which components are relevant and which are not, through assigning large smoothing parameters to the latter and consequently shrinking them toward the uniform distribution on the respective marginals. This effectively removes irrelevant components from contention, by suppressing their contribution to estimator variance; they already have very small bias, a consequence of their independence of Y. Cross-validation also yields important information about which components are relevant: the relevant components are precisely those that cross-validation has chosen to smooth in a traditional way, by assigning them smoothing parameters of conventional size. Indeed, cross-validation produces asymptotically optimal smoothing for relevant components, while eliminating irrelevant components by oversmoothing. In the problem of nonparamctric estimation of a conditional density, cross-validation comes into its own as a method with no obvious peers.

AB - Many practical problems, especially some connected with forecasting, require nonparametric estimation of conditional densities from mixed data. For example, given an explanatory data vector X for a prospective customer, with components that could include the customer's salary. occupation, age, sex, marital status, and address, a company might wish to estimate the density of the expenditure. Y, that could he made by that person, basing the inference on observations of (X,Y) for previous clients. Choosing appropriate smoothing parameters for this problem can he tricky, not in the least because plug-in rules take a particularly complex form in the case of mixed data. An obvious difliculty is that there exists no general formula for the optimal smoothing parameters. More insidiously, and more seriously, it can be difticult to determine which components of X are relevant to the problem of conditional inference. For example, if the jth component of X is independent of Y, then that component is irrelevant to estimating the density of Y given X, and ideally should he dropped before conducting inference. In this article we show that cross-validation overcomes these difficulties. It automatically determines which components are relevant and which are not, through assigning large smoothing parameters to the latter and consequently shrinking them toward the uniform distribution on the respective marginals. This effectively removes irrelevant components from contention, by suppressing their contribution to estimator variance; they already have very small bias, a consequence of their independence of Y. Cross-validation also yields important information about which components are relevant: the relevant components are precisely those that cross-validation has chosen to smooth in a traditional way, by assigning them smoothing parameters of conventional size. Indeed, cross-validation produces asymptotically optimal smoothing for relevant components, while eliminating irrelevant components by oversmoothing. In the problem of nonparamctric estimation of a conditional density, cross-validation comes into its own as a method with no obvious peers.

KW - Bandwidth choice

KW - Binary data

KW - Categorical data

KW - Continuous data

KW - Dimension reduction

KW - Discrete data

KW - Kernel methods

KW - Mixed data: Nonparametric density estimation

KW - Relevant and irrelevant data

KW - Smoothing parameter choice

UR - http://www.scopus.com/inward/record.url?scp=10144254861&partnerID=8YFLogxK

U2 - 10.1198/016214504000000548

DO - 10.1198/016214504000000548

M3 - Article

SN - 0162-1459

VL - 99

SP - 1015

EP - 1026

JO - Journal of the American Statistical Association

JF - Journal of the American Statistical Association

IS - 468

ER -