TY - JOUR
T1 - Optimality of universal bayesian sequence prediction for general loss and alphabet
AU - Hutter, Marcus
PY - 2004/8/15
Y1 - 2004/8/15
N2 - Various optimality properties of universal sequence predictors based on Bayes-mixtures in general, and Solomonoff's prediction scheme in particular, will be studied. The probability of observing xt at time t, given past observations x1...xt-1 can be computed with the chain rule if the true generating distribution μ of the sequences x 1x2x3... is known. If μ is unknown, but known to belong to a countable or continuous class M one can base ones prediction on the Bayes-mixture ℰ defined as Wv-weighted sum or integral of distributions V ε M. The cumulative expected loss of the Bayes-optimal universal prediction scheme based on ℰ is shown to be close to the loss of the Bayes-optimal, but infeasible prediction scheme based on μ. We show that the bounds are tight and that no other predictor can lead to significantly smaller bounds. Furthermore, for various performance measures, we show Pareto-optimality of ℰ and give an Occam's razor argument that the choice wv ∼ 2-k(v) for the weights is optimal, where K(v) is the length of the shortest program describing v. The results are applied to games of chance, defined as a sequence of bets, observations, and rewards. The prediction schemes (and bounds) are compared to the popular predictors based on expert advice. Extensions to infinite alphabets, partial, delayed and probabilistic prediction, classification, and more active systems are briefly discussed.
AB - Various optimality properties of universal sequence predictors based on Bayes-mixtures in general, and Solomonoff's prediction scheme in particular, will be studied. The probability of observing xt at time t, given past observations x1...xt-1 can be computed with the chain rule if the true generating distribution μ of the sequences x 1x2x3... is known. If μ is unknown, but known to belong to a countable or continuous class M one can base ones prediction on the Bayes-mixture ℰ defined as Wv-weighted sum or integral of distributions V ε M. The cumulative expected loss of the Bayes-optimal universal prediction scheme based on ℰ is shown to be close to the loss of the Bayes-optimal, but infeasible prediction scheme based on μ. We show that the bounds are tight and that no other predictor can lead to significantly smaller bounds. Furthermore, for various performance measures, we show Pareto-optimality of ℰ and give an Occam's razor argument that the choice wv ∼ 2-k(v) for the weights is optimal, where K(v) is the length of the shortest program describing v. The results are applied to games of chance, defined as a sequence of bets, observations, and rewards. The prediction schemes (and bounds) are compared to the popular predictors based on expert advice. Extensions to infinite alphabets, partial, delayed and probabilistic prediction, classification, and more active systems are briefly discussed.
KW - Bayesian sequence prediction
KW - Classification
KW - Games of chance
KW - Kolmogorov complexity
KW - Learning
KW - Mixture distributions
KW - Pareto-optimality
KW - Solomonoff induction
KW - Tight loss and error bounds
KW - Universal probability
UR - http://www.scopus.com/inward/record.url?scp=4644374039&partnerID=8YFLogxK
U2 - 10.1162/1532443041827952
DO - 10.1162/1532443041827952
M3 - Review article
SN - 1532-4435
VL - 4
SP - 971
EP - 1000
JO - Journal of Machine Learning Research
JF - Journal of Machine Learning Research
IS - 6
ER -