About Core Systems

Facebook Core Systems researchers and engineers design and build the distributed systems that power Facebook’s infrastructure. Our work spans across the engineering spectrum of research, development, deployment, and production as we ensure that our systems run efficiently, reliably, and securely across millions of machines in tens of geo-replicated data center regions.

Core Systems performs forward-looking research in the area of distributed systems and architecture at a global scale. Billions of people rely on the services we build and manage to connect and communicate. Throughout the lifecycle of these distributed services, we encounter fundamental research challenges in multiple areas, including capacity management, configuration management, cluster management, deployment, distributed tracing, efficiency, fault tolerance, monitoring, performance, power management, reliability, routing, scalability, service discovery, and storage systems.

We build a strong collaboration pipeline with key experts in academia through Distributed Systems PhD fellowships, requests for proposals, faculty summits, as well as internships and visiting researcher programs.

In recent years, we’ve published work on cluster management (Twine, OSDI 2020), configuration management (Configerator, SOSP 2015), fault tolerance (Kraken, OSDI 2016; Maelstrom, OSDI 2018; Taiji, SOSP 2019), tracing (Canopy, SOSP 2017), data center power management (Dynamo, ISCA 2016), and consensus protocol (Delos, OSDI 2020). View our Publications for a list of all our published research.

Taiji: Managing Global User Traffic for Large-Scale Internet Services at the Edge

David Chou, Tianyin Xu, Kaushik Veeraraghavan, Andrew Newell, Sonia Margulis, Lin Xiao, Pol Mauri Ruiz, Justin Meza, Kiryong Ha, Shruti Padmanabha, Kevin Cole, Dmitri Perelman

SOSP - October 29, 2019

A Large Scale Study of Data Center Network Reliability

Justin Meza, Tianyin Xu, Kaushik Veeraraghavan, Onur Mutlu

IMC - October 31, 2018

Maelstrom: Mitigating Datacenter-level Disasters by Draining Interdependent Traffic Safely and Efficiently

Kaushik Veeraraghavan, Justin Meza, Scott Michelson, Sankaralingam Panneerselvam, Alex Gyori, David Chou, Sonia Margulis, Daniel Obenshain, Ashish Shah, Yee Jiun Song, Tianyin Xu

OSDI - October 9, 2018

Canopy: An End-to-End Performance Tracing and Analysis System

Jonathan Kaldor, Jonathan Mace, Michał Bejda, Edison Gao, Wiktor Kuropatwa, Joe O’Neill, Kian Win Ong, Bill Schaller, Pingjia Shan, Brendan Viscomi, Vinod Venkataraman, Kaushik Veeraraghavan, Yee Jiun Song

SOSP 2017 - October 28, 2017

From academia to industry: How Facebook Engineer Jason Flinn started his journey in Core Systems

October 22, 2020

Asynchronous computing @Facebook: Driving efficiency and developer productivity at Facebook scale

August 17, 2020

Announcing the winners of the Distributed Systems research awards

February 28, 2020

Efficient, reliable cluster management at scale with Twine

June 6, 2019

