On Q-learning convergence for non-Markov decision processes

Sultan Javed Majeed, Marcus Hutter

    Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

    23 Citations (Scopus)

    Abstract

    Temporal-difference (TD) learning is an attractive, computationally efficient framework for model-free reinforcement learning. Q-learning is one of the most widely used TD learning technique that enables an agent to learn the optimal action-value function, i.e. Q-value function. Contrary to its widespread use, Q-learning has only been proven to converge on Markov Decision Processes (MDPs) and Q-uniform abstractions of finite-state MDPs. On the other hand, most real-world problems are inherently non-Markovian: the full true state of the environment is not revealed by recent observations. In this paper, we investigate the behavior of Q-learning when applied to non-MDP and non-ergodic domains which may have infinitely many underlying states. We prove that the convergence guarantee of Q-learning can be extended to a class of such non-MDP problems, in particular, to some non-stationary domains. We show that state-uniformity of the optimal Q-value function is a necessary and sufficient condition for Q-learning to converge even in the case of infinitely many internal states.

    Original languageEnglish
    Title of host publicationProceedings of the 27th International Joint Conference on Artificial Intelligence, IJCAI 2018
    EditorsJerome Lang
    PublisherInternational Joint Conferences on Artificial Intelligence
    Pages2546-2552
    Number of pages7
    ISBN (Electronic)9780999241127
    DOIs
    Publication statusPublished - 2018
    Event27th International Joint Conference on Artificial Intelligence, IJCAI 2018 - Stockholm, Sweden
    Duration: 13 Jul 201819 Jul 2018

    Publication series

    NameIJCAI International Joint Conference on Artificial Intelligence
    Volume2018-July
    ISSN (Print)1045-0823

    Conference

    Conference27th International Joint Conference on Artificial Intelligence, IJCAI 2018
    Country/TerritorySweden
    CityStockholm
    Period13/07/1819/07/18

    Fingerprint

    Dive into the research topics of 'On Q-learning convergence for non-Markov decision processes'. Together they form a unique fingerprint.

    Cite this