November 27, 2019
Tight Regret Bounds for Model-Based Reinforcement Learning with Greedy Policies
Neural Information Processing Systems (NeurIPS)
State-of-the-art efficient model-based Reinforcement Learning (RL) algorithms typically act by iteratively solving empirical models, i.e., by performing full-planning on Markov Decision Processes (MDPs) built by the gathered experience. In this paper, we focus on model-based RL in the finite-state finite-horizon undiscounted MDP setting and establish that exploring with greedy policies – act by 1-step planning – can achieve tight minimax performance in terms of regret, Õ(√HSAT).
By: Yonathan Efroni, Nadav Merlis, Mohammad Ghavamzadeh, Shie Mannor
Facebook AI Research