Publication

Zero Downtime Release: Disruption-free Load Balancing of a Multi-Billion User Website

ACM SIGCOMM


Abstract

Modern network infrastructure has evolved into a complex organism to satisfy the performance and availability requirements for the billions of users. Frequent releases such as code upgrades, bug fixes and security updates havebecome a norm.Millions of globally distributed infrastructure components including servers and load-balancers are restarted frequently from multiple times per-day to per-week. However, every release brings possibilities of disruptions as it can result in reduced cluster capacity, disturb intricate interaction of the components operating at large scales and disrupt the end-users by terminating their connections. The challenge is further complicated by the scale and heterogeneity of supported services and protocols.

In this paper, we leverage different components of the end-to-end networking infrastructure to prevent or mask any disruptions in face of releases. Zero Downtime Release is a collection of mechanisms used at Facebook to shield the end-users from any disruptions, preserve the cluster capacity and robustness of the infrastructure when updates are released globally. Our evaluation shows that these mechanisms prevent any significant cluster capacity degradation when a considerable number of productions servers and proxies are restarted and minimizes the disruption for different services (notably TCP, HTTP and publish/subscribe).

Related Publications

All Publications

International Workshop on Realizing Artificial Intelligence Synergies in Software Engineering (RAISE) - September 26, 2021

Behavioural and Structural Imitation Models in Facebook’s WW Simulation System

John Ahlgren, Kinga Bojarczuk, Inna Dvortsova, Mark Harman, Rayan Hatout, Maria Lomeli, Erik Meijer, Silvia Sapora

ESEM - September 23, 2021

Measurement Challenges for Cyber Cyber Digital Twins: Experiences from the Deployment of Facebook’s WW Simulation System

Kinga Bojarczuk, Inna Dvortsova, Johann George, Natalija Gucevska, Mark Harman, Maria Lomeli, Simon Mark Lucas, Erik Meijer, Rubmary Rojas, Silvia Sapora

Federated Learning for User Privacy and Data Confidentiality Workshop At ICML - July 24, 2021

Federated Learning with Buffered Asynchronous Aggregation

John Nguyen, Kshitiz Malik, Hongyuan Zhan, Ashkan Yousefpour, Michael Rabbat, Mani Malek, Dzmitry Huba

TSE - June 29, 2021

Learning From Mistakes: Machine Learning Enhanced Human Expert Effort Estimates

Federica Sarro, Rebecca Moussa, Alessio Petrozziello, Mark Harman

To help personalize content, tailor and measure ads, and provide a safer experience, we use cookies. By clicking or navigating the site, you agree to allow our collection of information on and off Facebook through cookies. Learn more, including about available controls: Cookies Policy