Perturbed-History Exploration in Stochastic Multi-Armed Bandits

International Joint Conference on Artificial Intelligence (IJCAI)

By: Branislav Kveton, Csaba Szepesvári, Mohammad Ghavamzadeh, Craig Boutilier


We propose an online algorithm for cumulative regret minimization in a stochastic multi-armed bandit. The algorithm adds O(t) i.i.d. pseudo-rewards to its history in round t and then pulls the arm with the highest average reward in its perturbed history. Therefore, we call it perturbed-history exploration (PHE). The pseudo-rewards are designed to offset the underestimated mean rewards of arms in round t with a high probability. We analyze PHE in a K-armed bandit and derive both O(KΔ−1 log n) and O(√Kn log n) bounds on its n-round regret, where ∆ denotes the minimum gap between the mean rewards of the optimal and suboptimal arms. The key to our analysis is a novel argument that shows that randomized Bernoulli rewards lead to optimism. We empirically compare PHE to several baselines and show that it is competitive with the best of them.