Abstract
We introduce an on-line algorithm for finding local maxima of the average reward in a Partially Observable Markov Decision Process (POMDP) controlled by a parameterized policy. Optimization is over the parameters of the policy. The algorithm's chief advantages are that it requires only a single sample path of the POMDP, it uses only one free parameter β ∈ [0,1], which has a natural interpretation in terms of a bias-variance trade-off, and it requires no knowledge of the underlying state. In addition, the algorithm can be applied to infinite state, control and observation spaces. We prove almost-sure convergence of our algorithm, and show how the correct setting of β is related to the mixing time of the Markov chain induced by the POMDP.
Original language | English |
---|---|
Pages (from-to) | 124-129 |
Number of pages | 6 |
Journal | Proceedings of the IEEE Conference on Decision and Control |
Volume | 1 |
Publication status | Published - 2000 |
Event | 39th IEEE Confernce on Decision and Control - Sydney, NSW, Australia Duration: 12 Dec 2000 → 15 Dec 2000 |