Facebook relies on a large suite of backend systems to serve billions of people each day. Many of these systems have a large number of internal parameters. For example, the web servers that power Facebook use the HipHop Virtual Machine (HHVM) to serve requests, and HHVM has dozens of parameters that control the just-in-time compiler. As another example, machine learning systems are used for a variety of prediction tasks. These systems typically involve multiple layers of predictive models, with a large number of parameters for determining how the models are linked together to yield a final recommendation. Such parameters must be carefully tuned through the use of live, randomized experiments, otherwise known as A/B tests. Each of these experiments may take a week or longer, and so the challenge is to optimize a set of parameters with as few experiments as possible.

A/B tests are often used as one-shot experiments for improving a product. In our paper Constrained Bayesian Optimization with Noisy Experiments, now in press at the journal *Bayesian Analysis*, we describe how we use an AI technique called Bayesian optimization to adaptively design rounds of A/B tests based on the results of prior tests. Compared to a grid search or manual tuning, Bayesian optimization allows us to jointly tune more parameters with fewer experiments and find better values. We have used these techniques for dozens of parameter tuning experiments across a range of backend systems, and have found that it is especially effective at tuning machine learning systems.

## Bayesian optimization for A/B testing

A typical approach for tuning parameters of an online system is to manually run small grid searches to optimize each parameter separately. Bayesian optimization constructs a statistical model of the relationship between the parameters and the online outcomes of interest, and uses that model to decide which experiments to run. This model-based approach has several key advantages, especially for tuning online machine learning systems.

*Better scaling with parameter dimensionality*: Given the limitations on the number of online experiments that can be run, grid search or manual tuning cannot be used to tune more than a couple parameters simultaneously. The use of models allows Bayesian optimization to scale to a much larger number of parameters; we routinely optimize up to 20 parameters jointly. This is important for machine learning systems where there are frequently interactions between parameters that necessitate joint optimization.

*Reduced number of experiments*: Bayesian optimization allows us to borrow strength across rounds of experimentation: A test of a parameter value in a continuous space gives us information about not only its outcome, but also those of nearby points. This allows us to greatly reduce the number of experiments needed to explore the space.

*Better experiment outcomes*: The model is able to identify parts of the space that are unlikely to provide a good outcome, and avoid running tests of those parameter values. This improves the experience inside the treatment groups.

*Understand the parameter space*: Modeling allows us to visualize and better understand how the parameters affect the outcome of interest. This plot shows a 2-d slice of an 8 parameter experiment, and is typical of the visualizations we make to better understand the parameter space:

We now provide a high-level description of Bayesian optimization, and then describe the work from the paper and some of the experimental results given there.

## Bayesian optimization

Bayesian optimization is a technique for solving optimization problems where the objective function (i.e., the online metric of interest) does not have an analytic expression, rather it can only be evaluated through some time consuming operation (i.e., a randomized experiment). The key to efficiently exploring a multidimensional space with a limited number of experiments is modeling: The true objective function is unknown, but we can fit a model to the observations we have so far and use the model to predict good parts of the parameter space where we should run additional experiments.

The Gaussian process (GP) is a nonparametric Bayesian model that works well for Bayesian optimization because it provides excellent uncertainty estimates and is analytically tractable. It provides an estimate of how an online metric varies with the parameters of interest, as illustrated here:

Each data marker in the figure above corresponds to the outcome of an A/B test of that parameter value. We can use the GP to decide which parameter to test next by balancing exploration (high uncertainty) with exploitation (good model estimate). This is done by computing an acquisition function that estimates the value of running an experiment with any given parameter value.

Suppose we are trying to decide if we should run an experiment with parameter configuration *x*. How valuable is an observation at *x*? The answer to this question depends on our utility function. Suppose we plan to end the optimization after observing *x*, and that the utility of the optimization is just the value of the best point that it found. In this case, the utility of observing *x* is how much its value *f(x)* improves over that of the current-best, which we will call *x* *(assume that we are maximizing):

* I(x) = max(0, f(x) – f(x*)).*

*I(x)* is a random variable inasmuch as *f(x)* is a random variable, but the posterior of *f(x)* is known analytically from the GP model. The expected improvement (EI) acquisition function chooses the point *x* that maximizes the expected value of *I(x)*, under the GP posterior. EI is a popular acquisition function because this expectation has an analytic form that is easy to compute, and it performs well in practice. This figure shows the EI for the model fit above:

The parameter that maximizes EI (red dashed line) is tested in the next experiment. The model is then updated with the results of that experiment, and EI is recomputed to select a new candidate. This loop is repeated until the search is finished, as shown for several iterations in the animation above.

The GP assumes that the response surface is smooth, which enables us to borrow strength from nearby observations in the parameter space. It allows us to be confident that we have thoroughly explored the space without having to actually test every possible parameter value.

## Our approach to Bayesian optimization with online experiments

There are several challenges with using Bayesian optimization for tuning online systems that motivated the work described in the paper.

*Noise*: Noise levels are often quite high in the observations made via randomized experiments, especially compared to typical machine learning applications of Bayesian optimization like hyperparameter tuning.

*Constraints*: In addition to the metric to optimize, there are also always constraints that must be satisfied. These constraints are often noisy metrics themselves, and so we must consider their posterior distribution of feasibility.

*Batch experimentation*: Because online experiments are so time consuming, they are typically not run in a linear sequence, as shown in the animation above, rather they are run in large batches of parallel tests. In our experiments we frequently evaluate up to 50 configurations at a time, and so require a Bayesian optimization procedure that can return large batches of points to be evaluated.

There are extensions to EI to handle each of these issues, however noise has typically been handled with heuristics that we show perform poorly when noise levels reach those typical of A/B tests. With observation noise, we now have uncertainty not only in the value *f(x)*, but we also have uncertainty in which observation is the current best, *x**, and its value, *f(x*)*. A common approach for dealing with observation noise has been to ignore it and use a plug-in estimator, by substituting the GP mean estimate of *f(x*)*.

In the paper we describe a Bayesian approach to handling observation noise in which we include the posterior uncertainty induced by noise in EI’s expectation. That is, instead of computing the expectation of *I(x)* under the posterior of *f(x)*, we compute it under the joint posterior of *f(x)* and *f(x*)*. This expectation no longer has a closed form like EI, however we can easily draw samples of the values at past observations *f(x_1), …, f(x_n)* from the GP posterior, and the conditional distribution *f(x) | f(x_1), …, f(x_n)* has closed form. The expectation is thus amenable to Monte Carlo approximation, where we sample *f(x_1), …, f(x_n)* and then take the conditional expectation *E[I(x) | f(x_1), …, f(x_n)]* in closed form. We show in the paper how quasi-Monte Carlo integration can provide good estimates of this expectation that can be efficiently optimized.

## Results of Bayesian optimizations at Facebook

We have used the approach described in the paper to optimize a number of systems at Facebook, and describe two such optimizations in the paper. The first was to optimize 6 parameters of one of Facebook’s ranking systems. These particular parameters were involved in the indexer, which aggregates content to be sent to the prediction models. The second example was to optimize 7 numeric compiler flags for HHVM. The goal of this optimization was to reduce CPU usage on the web servers, with a constraint on not increasing peak memory usage. This figure from the paper shows the CPU usage of each configuration tested (100 total), along with the probability that each point satisfied the memory constraint:

The first 30 iterations were randomly generated configurations (up to the green line), after which Bayesian optimization was used to identify parameter configurations to be evaluated. In this experiment, and many others that we’ve run, we found that Bayesian optimization was able to find better configurations that are more likely to satisfy the constraints. The compiler flag settings found in this optimization were made the new defaults in the open source HHVM.

We have found that with our approach described in the paper, Bayesian optimization is an effective and robust tool for optimizing via noisy experiments. It has made engineers’ lives easier by replacing manual exploration of multidimensional spaces, and has improved the online performance of backend systems and machine learning infrastructure.

——

Interested in working at the frontier of applied AI and causal inference? We’re hiring!