December 3, 2018
Near Optimal Exploration-Exploitation in Non-Communicating Markov Decision Processes
Neural Information Processing Systems (NeurIPS)
While designing the state space of an MDP, it is common to include states that are transient or not reachable by any policy (e.g., in mountain car, the product space of speed and position contains configurations that are not physically reachable). This results in weakly-communicating or multi-chain MDPs. In this paper, we introduce TUCRL, the first algorithm able to perform efficient exploration-exploitation in any finite Markov Decision Process (MDP) without requiring any form of prior knowledge.
By: Ronan Fruit, Matteo Pirotta, Alessandro Lazaric
Facebook AI Research