June 2, 2020

Protecting privacy in Facebook mobility data during the COVID-19 response

By: Amaç Herdağdelen, Alex Dow, Bogdan State, Payman Mohassel, Alex Pompe

Background

On April 6, 2020, Facebook Data for Good released three additions to our Disease Prevention Maps product to help health researchers and NGOs respond to the COVID-19 crisis. These range from educational institutions, such as the Harvard T.H. Chan School of Public Health, in the United States, National Tsing Hua University, in Taiwan, and the University of Venice, in Italy, to nonprofits and other organizations such as Direct Relief, the Bill & Melinda Gates Foundation, and the World Bank. A recent op-ed in Science, signed by doctors, epidemiologists, disease modeling experts, and data privacy scholars, underscored the need for this kind of data.

We are now releasing Facebook Movement Range Maps publicly. The Facebook Movement Range data is available in visualized format for 14 countries here and in CSV format for a longer list of countries here, on the United Nations Office for the Coordination of Humanitarian Affairs’ Humanitarian Data Exchange.

In order to preserve the privacy in these public data sets, we applied a new differential privacy (DP) framework. Differential privacy minimizes risk of reidentification of individual data with the help of possible additional information — even information we cannot anticipate now. Applying a DP framework takes into account the sensitivity of the data set and adds noise proportionally to ensure with high probability that no one can reidentify users.

Here are the details of how we generate the Facebook Movement Range data and how we use differential privacy to protect user privacy.

Facebook Movement Range data

These data sets are intended to inform researchers and public health experts about how populations are responding to physical distancing measures. In particular, there are two metrics, Change in Movement and Stay Put, that provide a slightly different perspective on movement trends. Change in Movement looks at how much people are moving around and compares it with a baseline period that predates most social distancing measures, while Stay Put looks at the fraction of the population that appear to stay within a small area during an entire day.

Where does the data come from, and who is included?

People who use Facebook on a mobile device have the option of providing their precise location in order to enable products like Nearby Friends and Find Wi-Fi and to get local content and ads. Movement Range Trends are produced by aggregating and de-identifying this data. Only people who opt in to Location History and background location collection are included. People with very few location pings in a day are not informative for these trends, and, therefore, we include only those people whose location is observed for a meaningful period of the day.

Each metric in this data set is produced for a given administrative region once per day. The regions we use are comparable to counties in the United States. Specifically, beyond U.S. counties, this includes level 3 statistical regions from the Nomenclature of Territorial Units for Statistics (NUTS) for European countries, and level 2 divisions from the Database of Global Administrative Areas (GADM) for other countries around the world. Some conflict areas, disputed territories, and countries where Facebook does not operate are omitted from the data sets. Each data point corresponds to a full day and night, from 8:00 p.m. one day to 7:59 p.m. the next day in local time.

To generate a data point for a given region, we aggregate the locations of users who spend evenings there. After mapping people to a region, we ensure that there is enough data to produce meaningful trends and to protect the privacy of individuals. Any region with fewer than 300 qualifying people is omitted from the data sets.

Calculating the Change in Movement metric

The idea behind Change in Movement is to understand how much less people are moving around since the onset of the coronavirus epidemic. We quantify how much people move around by counting the number of level-16 Bing tiles (which are approximately 600 meters by 600 meters in area at the equator) they are seen in within a day. People seen in more tiles are probably moving around more, while people seen in fewer are probably moving around less. Each day we take all the eligible people in a given region and compute the number of distinct tiles they were seen in. More precisely, let’s say that U_{d,r} is the set of eligible users in region r on day d, and tiles(u) is the number of tiles visited by a given user u in U_{d,r}.

To prevent extremely active people from skewing the data and to limit the amount of noise we need to add for differential privacy (discussed in detail below), we then “clip” U_{d,r} values at a maximum of 200 tiles. This means a user who is seen in more than 200 tiles will contribute only 200 tiles to the total for a region. We now sum the resulting clipped values to get the total number of tiles visited for that region.

total_tiles(U_{d,r}) = \Sum_{u \in U_{d,r}} min(tiles(u), 200)

At this point, we employ a differential privacy framework to protect privacy and provide a mathematical limit on the risk that an individual can be reidentified from the resulting data. To do this, we calculate an appropriate amount of noise to add to the total tiles visited for each region. The amount of noise is related to the sensitivity of the data — that is, the maximum effect that removing one person from the data set could have on the result. In the case of total tiles visited, the sensitivity is equal to the most tiles visited by any individual, which we’ve capped at 200, but which can be smaller. We’ll denote this sensitivity value as F.

Another important value for the calculation of differential privacy noise is the parameter referred to as epsilon (ε), which is meant to control the level of additional privacy protection reached by the addition of noise. A smaller epsilon means that it is harder to reidentify an individual in a data set and privacy is more protected. We use an epsilon value of 1.0 for the data sets we have released.

Finally, following the work of Dwork and Roth (2014), we generate the noise by drawing from a Laplace distribution, Laplace(μ,b), which takes two parameters: a location parameter (μ) and a diversity parameter (b). For our purposes, we have μ=0 and b=F/ε. We chose this Laplace distribution with μ=0 and b=F/ε as it satisfies epsilon differential privacy. This noise is added to the total tiles visited to get a noisy sum:

total_tiles’(U_{d,r}) = total_tiles(U_{d,r}) + Laplace(0,F/ε)

We now divide by the total number of people in the region to get the noisy average number of tiles visited.

avg_tiles’(U_{d,r}) = total_tiles’(U_{d,r}) / |U_{d,r}|

Now we can compute the (noisy) average number of tiles visited for a region for any given day. (Note: Facebook has data retention policies that limit how far back we can look.) Since we want to understand how movement has changed over time, we establish a baseline period for comparison. For most places in the world, we are using four weeks of February, from the 2nd to the 29th, as a baseline period. Because Italy instituted social distancing interventions earlier than the other countries for which we are producing data, we limit the Italy baseline to only the first two weeks of February. Finally, for the United States, we omit February 17 from the baseline, because it corresponds to the Presidents’ Day holiday and exhibits unusual levels of mobility.

Given the baseline period, we compute a separate baseline value for each day of the week by averaging avg_tiles’ for each instance of the day in the period: baseline_avg_tiles’_{r, day_of_week}. For every day following the baseline period, we compute the Change in Movement of avg_tiles on that day from the baseline:

(avg_tiles’(U_{d,r}) – baseline_avg_tiles’_{r, day_of_week(d)}) / baseline_avg_tiles’_{r, day_of_week(d)}

where day_of_week(d) specifies the day of the week for d.

Calculating the Stay Put metric

While Change in Movement shows how people have reduced the amount they are moving around, we also wanted to understand how many people were generally staying near or at home. Our Stay Put metric intends to measure this by calculating the percentage of eligible people who are only observed in a single level-16 Bing tile during the course of a day. Continuing with the notation used above, let’s say that U_{d,r} is the set of eligible users in region r on day d, and tiles(u) is the number of tiles visited by a given user u in U_{d,r}. Thus, we can compute the number of users who stayed put in region r on day d as:

num_stayput(U_{d,r}) = \Sum_{u \in U_{d,r}} ifelse(tiles(u) = 1, 1, 0)

As above, we want to employ a differential privacy framework to add noise to this count. The most that the inclusion of a single person can affect this metric is by adding 1, therefore the sensitivity value F = 1. Again, we add noise drawn from a Laplace distribution with μ=0 and b=F/ε. This noise is added to the “stay put” count to get a noisy count:

num_stayput’(U_{d,r}) = num_stayput(U_{d,r}) + Laplace(0,1)

Again, we use 1.0 as our epsilon value for calculating this metric. Since a single user will contribute to both Change in Movement and Stay Put, we must sum these two epsilon values together, resulting in a total differential privacy epsilon value of 2.0.

Now that we have a noisy count of the number of people that were present in only a single tile, we divide that by the total number of eligible people to get the Stay Put fraction for a region on a given day:

frac_stayput’(U_{d,r}) = num_stayput’(U_{d,r}) / |U_{d,r}|

Results

High-res version of seven-day rolling mean smoothing, ending on May 16 available here.

The Change in Movement data has been used around the world for assessments of nonpharmaceutical interventions and other policies meant to reduce the rate of coronavirus transmission. Change in Movement trendlines like that above have become part of numerous daily situational reports used around the world by policymakers. For example, California Gov. Gavin Newsom specifically referenced this data in a recent press conference providing updates on the state’s COVID-19 response.

The Staying Put Percentage data has also been extremely useful for COVID-19 response efforts. As seen above, this data clearly shows the effect across Europe of shelter-in-place orders since early March 2020 and their loosening in May 2020. The COVID-19 Mobility Data Network is composed of infectious disease epidemiologists at universities around the world using aggregated mobility data to support the COVID-19 response of governments. They also provide a visualization that includes the Staying Put Percentage updated daily to guide decision-makers around the world.


No items found