This paper fills the gap by presenting a large-scale, longitudinal study of data center network reliability based on operational data collected from the production network infrastructure at Facebook, one of the largest web service providers in the world.
This manuscript outlines the control system design process for a solar-powered unmanned high-altitude long endurance flying wing aircraft, called Aquila, which was developed by the Facebook Connectivity Lab to serve as communication backhaul for remote and rural connectivity.
Symposium on Operating Systems Design and Implementation (OSDI)
We present Maelstrom, a new system for mitigating and recovering from datacenter-level disasters. Maelstrom provides a traffic management framework with modular, reusable primitives that can be composed to safely and efficiently drain the traffic of interdependent services from one or more failing datacenters to the healthy ones.
By: Kaushik Veeraraghavan, Justin Meza, Scott Michelson, Sankaralingam Panneerselvam, Alex Gyori, David Chou, Sonia Margulis, Daniel Obenshain, Ashish Shah, Yee Jiun Song, Tianyin Xu
USENIX Symposium on Operating Systems Design and Implementation (OSDI)
Akkio is a locality management service layered between client applications and distributed datastore systems. It determines how and when to migrate data to reduce response times and resource usage. Akkio primarily targets multi-datacenter geo-distributed datastore systems.
By: Muthukaruppan Annamalai, Kaushik Ravichandran, Harish Srinivas, Igor Zinkovsky, Luning Pan, Tony Savor, David Nagle, Michael Stumm
IEEE International Working Conference on Source Code Analysis and Manipulation (SCAM)
This paper describes some of the challenges and opportunities when deploying static and dynamic analysis at scale, drawing on the authors’ experience with the Infer and Sapienz Technologies at Facebook, each of which started life as a research-led start-up that was subsequently deployed at scale, impacting billions of people worldwide.
International Conference on Very Large Data Bases (VLDB)
This paper describes an end-to-end streaming join service that addresses the challenges above through a streaming join operator that uses an adaptive stream synchronization algorithm that is able to handle the different distributions we observe in real-world streams regarding their event times.
By: Gabriela Jacques da Silva, Ran Lei, Luwei Cheng, Guoqiang Jerry Chen, Kuen Ching, Tanji Hu, Yuan Mei, Kevin Wilfong, Rithin Shetty, Serhat Yilmaz, Anirban Banerjee, Benjamin Heintz, Shridhar Iyer, Anshul Jaiswal
This paper describes the end-to-end regression detection system designed and used at Facebook. The main detection algorithm is based on sequential statistics supplemented by signal processing transformations, and the performance of the algorithm was assessed with a mixture of online and offline tests across different use cases.