Research Area
Year Published

132 Results

October 31, 2018

A Large Scale Study of Data Center Network Reliability

Internet Measurement Conference (ICM)

This paper fills the gap by presenting a large-scale, longitudinal study of data center network reliability based on operational data collected from the production network infrastructure at Facebook, one of the largest web service providers in the world.

By: Justin Meza, Tianyin Xu, Kaushik Veeraraghavan, Onur Mutlu

October 12, 2018

Flight Control System Design for a High Altitude, Long Endurance Airplane: Sensor Distribution and Flexible Modes Control

This manuscript outlines the control system design process for a solar-powered unmanned high-altitude long endurance flying wing aircraft, called Aquila, which was developed by the Facebook Connectivity Lab to serve as communication backhaul for remote and rural connectivity.

By: Hamidreza Bolandhemmat

October 12, 2018

Energy-Optimized Trajectory Planning for High Altitude Long Endurance (HALE) Aircraft

This paper outlines the energy-optimized trajectory planning problem for high altitude, long endurance (HALE) aircraft and explores both offline and online optimization techniques to address it.

By: Hamidreza Bolandhemmat, Jack Marriott, Benjamin Thomsen

October 9, 2018

Maelstrom: Mitigating Datacenter-level Disasters by Draining Interdependent Traffic Safely and Efficiently

Symposium on Operating Systems Design and Implementation (OSDI)

We present Maelstrom, a new system for mitigating and recovering from datacenter-level disasters. Maelstrom provides a traffic management framework with modular, reusable primitives that can be composed to safely and efficiently drain the traffic of interdependent services from one or more failing datacenters to the healthy ones.

By: Kaushik Veeraraghavan, Justin Meza, Scott Michelson, Sankaralingam Panneerselvam, Alex Gyori, David Chou, Sonia Margulis, Daniel Obenshain, Ashish Shah, Yee Jiun Song, Tianyin Xu

October 8, 2018

Sharding the Shards: Managing Datastore Locality at Scale with Akkio

USENIX Symposium on Operating Systems Design and Implementation (OSDI)

Akkio is a locality management service layered between client applications and distributed datastore systems. It determines how and when to migrate data to reduce response times and resource usage. Akkio primarily targets multi-datacenter geo-distributed datastore systems.

By: Muthukaruppan Annamalai, Kaushik Ravichandran, Harish Srinivas, Igor Zinkovsky, Luning Pan, Tony Savor, David Nagle, Michael Stumm

September 23, 2018

From Start-ups to Scale-ups: Opportunities and Open Problems for Static and Dynamic Program Analysis

IEEE International Working Conference on Source Code Analysis and Manipulation (SCAM)

This paper describes some of the challenges and opportunities when deploying static and dynamic analysis at scale, drawing on the authors’ experience with the Infer and Sapienz Technologies at Facebook, each of which started life as a research-led start-up that was subsequently deployed at scale, impacting billions of people worldwide.

By: Mark Harman, Peter O'Hearn

August 27, 2018

Providing Streaming Joins as a Service at Facebook

International Conference on Very Large Data Bases (VLDB)

This paper describes an end-to-end streaming join service that addresses the challenges above through a streaming join operator that uses an adaptive stream synchronization algorithm that is able to handle the different distributions we observe in real-world streams regarding their event times.

By: Gabriela Jacques da Silva, Ran Lei, Luwei Cheng, Guoqiang Jerry Chen, Kuen Ching, Tanji Hu, Yuan Mei, Kevin Wilfong, Rithin Shetty, Serhat Yilmaz, Anirban Banerjee, Benjamin Heintz, Shridhar Iyer, Anshul Jaiswal

August 26, 2018

Rosetta: Large Scale System for Text Detection and Recognition in Images

Knowledge Discovery in Databases (KDD)

In this paper we present a deployed, scalable optical character recognition (OCR) system, which we call Rosetta, designed to process images uploaded daily at Facebook scale.

By: Fedor Borisyuk, Albert Gordo, Viswanath Sivakumar

August 22, 2018

FBOSS: Building Switch Software at Scale


We present FBOSS, our own data center switch software, that is designed with the basis on our switch-as-a-server and deploy-early-and-iterate principles.

By: Sean Choi, Boris Burkov, Alex Eckert, Tian Fang, Saman Kazemkhani, Rob Sherwood, Ying Zhang, James Hongyi Zeng

August 19, 2018

A real-time framework for detecting efficiency regressions in a globally distributed codebase

Knowledge Discovery in Databases (KDD)

This paper describes the end-to-end regression detection system designed and used at Facebook. The main detection algorithm is based on sequential statistics supplemented by signal processing transformations, and the performance of the algorithm was assessed with a mixture of online and offline tests across different use cases.

By: Martin Valdez-Vivas, Caner Gocmen, Andrii Korotkov, Ethan Fang, Kapil Goenka, Sherry Chen