## What the research is:

Many businesses, including Facebook, send SMS messages to their users for phone number verification, two-factor authentication, and notifications. In order to deliver these SMS messages to their users, companies generally leverage aggregators (e.g., Twilio) that have deals with operators throughout the world. These aggregators are responsible for delivering the SMS message to users and offer different quality and cost attributes. Quality, in this context, is the likelihood that a message will be successfully delivered to a user. A key decision faced by the business is identifying the best aggregator to route these messages. However, a significant challenge here is the nonstationarity in aggregators’ quality, which varies substantially over time. This necessitates a balanced exploration-exploitation approach, where we need to learn the best aggregator at any given time and maximize the number of messages we route through them. Multi-armed bandits (MAB) are a natural framework to formulate this problem. However, existing multi-armed bandit literature largely focuses on optimizing a single objective function and cannot be easily generalizable to the setting where we have multiple metrics, like quality and cost.

Motivated by the above problem, in this paper, we propose a novel variant of the MAB problem, which factors costs associated with playing an arm and introduces new metrics that uniquely capture the features of multiple real-world applications. We argue about the hardness* *of this problem by establishing fundamental limits on the performance of any online algorithm. We further show that naive generalization of existing algorithms performs poorly from a theoretical and empirical standpoint. Lastly, we propose a simple algorithm that balances two asymmetric objectives and achieves near-optimal performance.

## How it works:

In traditional (stochastic) multi-armed bandit problem, the learning agent has access to a set of K actions (arms) with unknown but fixed reward distributions and has to repeatedly select an arm to maximize the cumulative reward. Here, the challenge is developing a policy that balances the tension between acquiring information about actions with little historical observations and exploiting the most rewarding arm based on existing information. Regret metric is typically used to measure the effectiveness of any such adaptive policy. Briefly, regret measures the cumulative difference between expected reward from the best action, had we known the true reward distributions with the expected reward from the action preferred by the policy. Existing literature has extensively studied this setting, leading to simple but extremely effective algorithms like Upper Confidence Bound (UCB) and Thompson Sampling (TS), which have been further generalized and applied to a wide range of application domains. The central focus of these algorithms is incentivizing sufficient exploration of actions that appear promising. In particular, these approaches ensure that the best action always has a chance to get out of situations where the expected reward function is underestimated due to unfavorable randomness. However, there is also a fundamental limitation on how well an online learning algorithm can perform in general settings. In situations where the number of decision epochs is small and/or there are many actions with reward distributions similar to the optimal action, it becomes hard to effectively learn the optimal action. Mathematically speaking, in the traditional problem, it has been established that any online learning algorithm must incur a regret of Ω(KT) where K is the number of arms and T is the number of decision epochs.

In this work, we generalize the MAB framework to the multiobjective setting. In particular, we consider the problem setting where the learning agent has balanced the traditional exploration-exploitation trade-offs and trade-offs associated with multiple objectives, for example reward and cost. Keeping the SMS application in mind, to manage costs, we allow the agent to be agnostic between actions, whose expected reward (quality) is greater than 1−α fraction of the highest expected reward (quality). We refer to α as the subsidy factor and assume it is a known parameter specified by the problem domain. The agent’s objective is to explore various actions and identify the cheapest arm among these high-quality arms as frequently as possible. To measure the performance of any policy, we define two notions of regret, *q*uality regret and *c*ost regret. Quality regret is defined as the cumulative difference between α-adjusted expected reward from the highest quality action and the expected reward from the action preferred by the policy. Similarly, *c*ost regret is defined as the cumulative difference between the cost of the action preferred by our policy and the cost of the cheapest feasible arms, had the qualities and costs been known beforehand.

For this problem, we show that naively extending existing algorithms, like TS, will perform poorly on the cost regret.* *In particular, we consider the variant of TS, where we form the feasible set based on the corresponding estimates on quality and pick the cheapest feasible action. For this variant, we can show that TS performs arbitrarily worse (i.e., incurs linear regret). This is primarily because existing algorithms are designed to incentivize exploration of promising actions and can end up incurring large *cost regret* in settings where there are two actions with similar rewards but vastly different costs. We then establish a fundamental lower bound of Ω(K1/3T2/3) on the performance of any online learning algorithm for this problem, highlighting the hardness of our problem in comparison to the classical MAB problem. Building on these insights, we develop a simple algorithm based on the explore-then-commit* *idea, which balances the tension between two asymmetric objectives and achieves near-optimal performance up to logarithmic factors. We also demonstrate the superior performance of our algorithm through numerical simulations:

## Why it matters:

Multi-armed Bandits (MAB) is the most popular paradigm to balance the exploration-exploitation trade-off that is intrinsic to online decision making under uncertainty. They are applied to a wide range of application domains, including drug trials and online experiments, to ensure that maximal number samples are offered to the most promising candidate. Similar trade-offs are typical in recommendation systems, where the possible options to recommend and user preferences are constantly evolving. Facebook also has leveraged the MAB framework to improve various products including identifying the ideal video bandwidth allocation for users and best aggregator for sending authentication messages. Though one can model this in the traditional MAB framework by considering cost subtracted from the reward as the modified objective, such a modification is not always meaningful, particularly in settings where the reward and cost associated with an action represent different quantities (for example, quality and cost of an aggregator). In such problems, it is natural for the learning agent to optimize for both the metrics, typically avoiding incurring exorbitant costs for a marginal increase in cumulative reward. To the best of our knowledge, this paper takes a first step in generalizing the multi-armed bandit framework to problems with two metrics and presents both fundamental theoretical performance limits as well as easy-to-implement algorithms to balance the multiobjective trade-offs.

Lastly, we perform extensive simulations to understand various regimes of the problem parameters and compare different algorithms. More specifically, we consider scenarios where naive generalizations of UCB and TS, which have been adapted in real-life implementations, perform well and settings where they perform poorly, which is of interest to practitioners.