Bayesian real-time dynamic programming

Scott Sanner*, Robby Goetschalckx, Kurt Driessens, Guy Shani

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

31 Citations (Scopus)

Abstract

Real-time dynamic programming (RTDP) solves Markov decision processes (MDPs) when the initial state is restricted, by focusing dynamic programming on the envelope of states reachable from an initial state set. RTDP often provides performance guarantees without visiting the entire state space. Building on RTDP, recent work has sought to improve its efficiency through various optimizations, including maintaining upper and lower bounds to both govern trial termination and prioritize state exploration. In this work, we take a Bayesian perspective on these upper and lower bounds and use a value of perfect information (VPI) analysis to govern trial termination and exploration in a novel algorithm we call VPI-RTDP. VPI-RTDP leads to an improvement over state-of-the-art RTDP methods, empirically yielding up to a three-fold reduction in the amount of time and number of visited states required to achieve comparable policy performance.

Original languageEnglish
Title of host publicationIJCAI-09 - Proceedings of the 21st International Joint Conference on Artificial Intelligence
PublisherInternational Joint Conferences on Artificial Intelligence
Pages1784-1789
Number of pages6
ISBN (Print)9781577354260
Publication statusPublished - 2009
Externally publishedYes
Event21st International Joint Conference on Artificial Intelligence, IJCAI 2009 - Pasadena, United States
Duration: 11 Jul 200916 Jul 2009

Publication series

NameIJCAI International Joint Conference on Artificial Intelligence
ISSN (Print)1045-0823

Conference

Conference21st International Joint Conference on Artificial Intelligence, IJCAI 2009
Country/TerritoryUnited States
CityPasadena
Period11/07/0916/07/09

Fingerprint

Dive into the research topics of 'Bayesian real-time dynamic programming'. Together they form a unique fingerprint.

Cite this