Near Optimal Exploration-Exploitation in Non-Communicating Markov Decision Processes

Neural Information Processing Systems (NeurIPS)


While designing the state space of an MDP, it is common to include states that are transient or not reachable by any policy (e.g., in mountain car, the product space of speed and position contains configurations that are not physically reachable). This results in weakly-communicating or multi-chain MDPs. In this paper, we introduce TUCRL, the first algorithm able to perform efficient exploration-exploitation in any finite Markov Decision Process (MDP) without requiring any form of prior knowledge. In particular, for any MDP with SC communicating states, A actions and ΓCSC possible communicating next states, we derive a Õ(DC √ ΓCSCAT) regret bound, where DC is the diameter (i.e., the length of the longest shortest path between any two states) of the communicating part of the MDP. This is in contrast with existing optimistic algorithms (e.g., UCRL, Optimistic PSRL) that suffer linear regret in weakly-communicating MDPs, as well as posterior sampling or regularised algorithms (e.g., REGAL), which require prior knowledge on the bias span of the optimal policy to achieve sub-linear regret. We also prove that in weakly-communicating MDPs, no algorithm can ever achieve a logarithmic growth of the regret without first suffering a linear regret for a number of steps that is exponential in the parameters of the MDP. Finally, we report numerical simulations supporting our theoretical findings and showing how TUCRL overcomes the limitations of the state-of-the-art.

Related Publications

All Publications

NeurIPS - December 6, 2021

Parallel Bayesian Optimization of Multiple Noisy Objectives with Expected Hypervolume Improvement

Samuel Daulton, Maximilian Balandat, Eytan Bakshy

arXiv - January 29, 2020

fastMRI: An Open Dataset and Benchmarks for Accelerated MRI

Jure Zbontar, Florian Knoll, Anuroop Sriram, Tullie Murrell, Zhengnan Huang, Matthew J. Muckley, Aaron Defazio, Ruben Stern, Patricia Johnson, Mary Bruno, Marc Parente, Krzysztof J. Geras, Joe Katsnelson, Hersh Chandarana, Zizhao Zhang, Michal Drozdzal, Adriana Romero, Michael Rabbat, Pascal Vincent, Nafissa Yakubova, James Pinkerton, Duo Wang, Erich Owens, Larry Zitnick, Michael P. Recht, Daniel K. Sodickson, Yvonne W. Lui

arXiv - April 20, 2021

MBRL-Lib: A Modular Library for Model-based Reinforcement Learning

Luis Pineda, Brandon Amos, Amy Zhang, Nathan O. Lambert, Roberto Calandra

ICCC - September 14, 2021

Visual Conceptual Blending with Large-scale Language and Vision Models

Songwei Ge, Devi Parikh

To help personalize content, tailor and measure ads, and provide a safer experience, we use cookies. By clicking or navigating the site, you agree to allow our collection of information on and off Facebook through cookies. Learn more, including about available controls: Cookie Policy