Viet-An Nguyen, Peibei Shi, Jagdish Ramakrishnan, and Udi Weinsberg are Research Scientists at Facebook working within Core Data Science, a research and development team focused on improving Facebook’s processes, infrastructure, and products.
What we did
Online services have made great strides in leveraging machine-learned models to fight abuse at scale. For example, 99.5 percent of takedowns on fake Facebook accounts are proactively detected before users report them. However, despite this progress, there are many areas where large-scale systems still rely on human decisions for a range of tasks, including collecting labels for training models, enforcing a range of policies, and reviewing appeals.
An obvious challenge that arises when relying on human reviewers is that we are inherently noisy and potentially biased decision-makers. While bias is a trait of the individual, noise can result from subjectivity or ambiguity of the decision guidelines, or from simple mistakes that are commonly the result of fatigue or pressure. In this work, we consider three applications where mistakes made by human decisions can have negative outcomes:
- Enforcement: When community standards are being enforced, an incorrect decision can result in taking down a benign piece of content from the platform or leaving violating content on the platform.
- Training machine learning models: Using inaccurate human-generated “ground truth” labels might lead to inaccurate models.
- Prevalence estimation: Prevalence is the percentage of policy-violating content out of all content seen by Facebook users. It is computed by sampling content and sending it to reviewers, who review it for violations. Failing to consider mistakes in these reviews can lead to incorrect prevalence estimates and confidence intervals.
A scalable solution to reduce the likelihood of mistakes is to assign multiple reviewers to each task. If available, it is possible to augment human decisions with additional nonhuman signals, such as scores from machine learning models. A key challenge that rises in these settings is the need to aggregate multiple potentially conflicting decisions and provide an estimate for the certainty of the decision.
In our paper, to be published at the 2020 ACM Conference on Knowledge Discovery and Data Mining, we present CLARA (Confidence of Labels and Raters), a system built and deployed at Facebook to estimate the uncertainty in human-generated decisions. We show how CLARA is used at Facebook to obtain more accurate decisions overall while reducing operational resource use.
How we did it
We follow a rich body of research on crowdsourcing and take a Bayesian probabilistic approach to define different latent variables and the generative process of the observed data. In particular, the observed data includes a set of items, each of which receives multiple labels and potentially one or more scores from machine learning models. CLARA estimates the following latent variables:
- Overall prevalence: The rate at which each label category occurs
- Per-reviewer confusion matrix: Each reviewer’s ability to correctly label items of different true label categories
- Per-item true label: The true latent label category of each item
- Score mixture: The different score distributions of items from different label categories
For posterior inference, we implemented a collapsed Gibbs sampling algorithm to infer the values of all latent variables given the observed data. Figure 1 shows the graphical model of CLARA together with illustrative examples of the observed and latent variables.
We’ve deployed CLARA at scale at Facebook. While similar underlying models have been studied in the literature, this work provides the details of a large-scale, real-world deployment of a complete system, with both offline and online aggregation and uncertainty estimation capabilities.
Figure 2 illustrates an overview of how CLARA is deployed at scale in production at Facebook.
One of the key applications where we use CLARA in Facebook is the efficient allocation of labeling resources based on confidence scores. We achieve this by obtaining additional reviews only when the decision confidence given by CLARA is not sufficiently high. This results in a cost/accuracy trade-off, where higher levels of decision confidence result in additional reviews. An example trade-off curve, which uses simulated “ground truth” and labeling mistakes, is shown in Figure 3. The figure depicts the change in accuracy (left) and mean absolute error (right) as a function of the percent of labels. Compared to a random sampling baseline, the figure shows that CLARA provides a better trade-off curve, enabling an efficient usage of labeling resources. In a production deployment, we found that CLARA can save up to 20 percent of total reviews compared to majority vote. You can find more details and results in our paper.
How we are extending this work
The current implementation of CLARA leverages machine learning scores by treating them as nonbinary “artificial reviewers.” However, we observe that human mistakes are often correlated with the difficulty of the task, which can be reflected in the machine learning score. We are developing a continuous “confusion function” and prevalence function, which takes into account the difficulty of the task as captured by the machine learning score.