November 2, 2016

OSDI ’16: Showcasing the top Systems Research

By: Kelly Berschauer

Facebook Researchers are sharing their latest systems research at OSDI’16

Identifying and resolving resource utilization bottlenecks, and improving data-quality tradeoffs, for large-scale Web services are the research topics of two papers being presented by Facebook researchers at the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI) conference this week. Long known as the most influential conference in the operating systems and distributed systems areas, OSDI brings together top systems practitioners with a diverse set of perspectives, from academia and industry, to discuss and collaborate on the most challenging systems problems.

“Attending OSDI is a great way to keep abreast of the top systems research—I always find out about new approaches to problems we see at Facebook, and also have the opportunity to educate the community on what we have learned,” said Kaushik Veeraraghavan software engineer at Facebook. “But most importantly, OSDI gives us the opportunity to maintain ties with our research colleagues. Many of our best research collaborations and top interns have come from discussions we’ve had at OSDI.”

Facebook papers being presented at OSDI include:

The Kraken: Leveraging Live Traffic Tests to Identify and Resolve Resource Utilization Bottlenecks in Large Scale Web Services paper, by Facebook researchers and engineers: Kaushik Veeraraghavan, Justin Meza, David Chou, Wonho Kim, Sonia Margulis, Scott Michelson, Rajesh Nishtala, Daniel Obenshain, Dmitri Perelman, and Yee Jiun Song, describes a system in production at Facebook that leverages live user traffic to empirically load test every level of the infrastructure stack to measure capacity.

Previous to implementing Kraken, large scale web services at Facebook were difficult to accurately model because they are composed of hundreds of rapidly evolving software systems, distributed across geo-replicated data centers, and have constantly changing workloads. Each system needed to be allocated capacity, configured, and tuned to use data center resources efficiently, while user behavior and software components evolved constantly.

The teams work was motivated by three key insights: (1) live user traffic accessing a web service provides the most current target workload possible, (2) the system can be empirically tested to identify its scalability limits, and (3) the user impact and operational overhead of empirical testing can be largely eliminated by building automation which adjusts live traffic based on feedback.

Based on these insights, the team designed the Kraken system, which has been in production at Facebook for three years, managing the traffic generated by 1.7 billion users, and improving hardware utilization by over 20%.

The DQBarge: Improving Data-Quality Tradeoffs in Large-Scale Internet Services paper, being presented by Facebook engineer Michael Chow is based on work done during his internship with his mentor Kaushik Veeraraghavan, and his University of Michigan colleagues Michael Cafarella, and Jason Flinn. The paper demonstrates that data-quality tradeoffs prevalent in Internet service pipelines are often sub-optimal because they are reactive and fail to consider global information. DQBarge enables better tradeoffs by propagating data along the causal path of request processing and generating models of performance and quality for potential tradeoffs. Improving responses to load spikes, utilization of spare resources, and dynamic capacity planning.