Fast Database Restarts at Facebook

ACM Special Interest Group on Management of Data (SIGMOD)


Facebook engineers query multiple databases to monitor and analyze Facebook products and services. The fastest of these databases is Scuba, which achieves subsecond query response time by storing all of its data in memory across hundreds of servers. We are continually improving the code for Scuba and would like to push new software releases at least once a week. However, restarting a Scuba machine clears its memory. Recovering all of its data from disk — about 120 GB per machine — takes 2.5-3 hours to read and format the data per machine. Even 10 minutes is a long downtime for the critical applications that rely on Scuba, such as detecting user-facing errors. Restarting only 2% of the servers at a time mitigates the amount of unavailable data, but prolongs the restart duration to about 12 hours, during which users see only partial query results and one engineer needs to monitor the servers carefully. We need a faster, less engineer intensive, solution to enable frequent software upgrades.

In this paper, we show that using shared memory provides a simple, effective, fast, solution to upgrading servers. Our key observation is that we can decouple the memory lifetime from the process lifetime. When we shutdown a server for a planned upgrade, we know that the memory state is valid (unlike when a server shuts down unexpectedly). We can therefore use shared memory to preserve memory state from the old server process to the new process. Our solution does not increase the server memory footprint and allows recovery at memory speeds, about 2-3 minutes per server. This solution maximizes uptime and availability, which has led to much faster and more frequent rollouts of new features and improvements. Furthermore, this technique can be applied to the in-memory state of any database, even if the memory contains a cache of a much larger disk-resident data set, as in most databases.

Related Publications

All Publications

ISAAC - December 5, 2021

On the Extended TSP Problem

Julián Mestre, Sergey Pupyrev, Seeun William Umboh

ICML - July 24, 2021

Ditto: Fair and Robust Federated Learning Through Personalization

Tian Li, Shengyuan Hu, Ahmad Beirami, Virginia Smith

International Workshop on Realizing Artificial Intelligence Synergies in Software Engineering (RAISE) - September 26, 2021

Behavioural and Structural Imitation Models in Facebook’s WW Simulation System

John Ahlgren, Kinga Bojarczuk, Inna Dvortsova, Mark Harman, Rayan Hatout, Maria Lomeli, Erik Meijer, Silvia Sapora

ESEM - September 23, 2021

Measurement Challenges for Cyber Cyber Digital Twins: Experiences from the Deployment of Facebook’s WW Simulation System

Kinga Bojarczuk, Inna Dvortsova, Johann George, Natalija Gucevska, Mark Harman, Maria Lomeli, Simon Mark Lucas, Erik Meijer, Rubmary Rojas, Silvia Sapora

To help personalize content, tailor and measure ads, and provide a safer experience, we use cookies. By clicking or navigating the site, you agree to allow our collection of information on and off Facebook through cookies. Learn more, including about available controls: Cookies Policy