TY - GEN
T1 - Context tree maximizing reinforcement learning
AU - Nguyen, Phuong
AU - Sunehag, Peter
AU - Hutter, Marcus
PY - 2012
Y1 - 2012
N2 - Recent developments in reinforcement learning for non-Markovian problems witness a surge in history-based methods, among which we are particularly interested in two frameworks, ΦMDP and MC-AIXI-CTW. ΦMDP attempts to reduce the general RL problem, where the environment's states and dynamics are both unknown, to an MDP, while MC-AIXI-CTW incrementally learns a mixture of context trees as its environment model. The main idea of ΦMDP is to connect generic reinforcement learning with classical reinforcement learning. The first implementation of ΦMDP relies on a stochastic search procedure for finding a tree that minimizes a certain cost function. This does not guarantee finding the minimizing tree, or even a good one, given limited search time. As a consequence it appears that the approach has difficulties with large domains. MC-AIXI-CTW is attractive in that it can incrementally and analytically compute the internal model through interactions with the environment. Unfortunately, it is computationally demanding due to requiring heavy planning simulations at every single time step. We devise a novel approach called CTMRL, which analytically and efficiently finds the cost-minimizing tree. Instead of the context-tree weighting method that MC-AIXI-CTW is based on, we use the closely related context-tree maximizing algorithm that selects just one single tree. This approach falls under the ΦMDP framework, which allows the replacement of the costly planning component of MC-AIXI-CTW with simple Q-Learning. Our empirical investigation shows that CTMRL finds policies of quality as good as MC-AIXI-CTW's on six domains including a challenging Pacman domain, but in an order of magnitude less time.
AB - Recent developments in reinforcement learning for non-Markovian problems witness a surge in history-based methods, among which we are particularly interested in two frameworks, ΦMDP and MC-AIXI-CTW. ΦMDP attempts to reduce the general RL problem, where the environment's states and dynamics are both unknown, to an MDP, while MC-AIXI-CTW incrementally learns a mixture of context trees as its environment model. The main idea of ΦMDP is to connect generic reinforcement learning with classical reinforcement learning. The first implementation of ΦMDP relies on a stochastic search procedure for finding a tree that minimizes a certain cost function. This does not guarantee finding the minimizing tree, or even a good one, given limited search time. As a consequence it appears that the approach has difficulties with large domains. MC-AIXI-CTW is attractive in that it can incrementally and analytically compute the internal model through interactions with the environment. Unfortunately, it is computationally demanding due to requiring heavy planning simulations at every single time step. We devise a novel approach called CTMRL, which analytically and efficiently finds the cost-minimizing tree. Instead of the context-tree weighting method that MC-AIXI-CTW is based on, we use the closely related context-tree maximizing algorithm that selects just one single tree. This approach falls under the ΦMDP framework, which allows the replacement of the costly planning component of MC-AIXI-CTW with simple Q-Learning. Our empirical investigation shows that CTMRL finds policies of quality as good as MC-AIXI-CTW's on six domains including a challenging Pacman domain, but in an order of magnitude less time.
UR - http://www.scopus.com/inward/record.url?scp=84868292790&partnerID=8YFLogxK
M3 - Conference contribution
SN - 9781577355687
T3 - Proceedings of the National Conference on Artificial Intelligence
SP - 1075
EP - 1082
BT - AAAI-12 / IAAI-12 - Proceedings of the 26th AAAI Conference on Artificial Intelligence and the 24th Innovative Applications of Artificial Intelligence Conference
T2 - 26th AAAI Conference on Artificial Intelligence and the 24th Innovative Applications of Artificial Intelligence Conference, AAAI-12 / IAAI-12
Y2 - 22 July 2012 through 26 July 2012
ER -