The Mystery Machine: End-to-end Performance Analysis of Large-scale Internet Services

Operating Systems Design and Implementation


Current debugging and optimization methods scale poorly to deal with the complexity of modern Internet services, in which a single request triggers parallel execution of numerous heterogeneous software components over a distributed set of computers. The Achilles’ heel of current methods is the need for a complete and accurate model of the system under observation: producing such a model is challenging because it requires either assimilating the collective knowledge of hundreds of programmers responsible for the individual components or restricting the ways in which components interact.

Fortunately, the scale of modern Internet services offers a compensating benefit: the sheer volume of requests serviced means that, even at low sampling rates, one can gather a tremendous amount of empirical performance observations and apply “big data” techniques to analyze those observations. In this paper, we show how one can automatically construct a model of request execution from pre-existing component logs by generating a large number of potential hypotheses about program behavior and rejecting hypotheses contradicted by the empirical observations. We also show how one can validate potential performance improvements without costly implementation effort by leveraging the variation in component behavior that arises naturally over large numbers of requests to measure the impact of optimizing individual components or changing scheduling behavior.

We validate our methodology by analyzing performance traces of over 1.3 million requests to Facebook servers. We present a detailed study of the factors that affect the end-to-end latency of such requests. We also use our methodology to suggest and validate a scheduling optimization for improving Facebook request latency.

Related Publications

All Publications

HPCA - March 3, 2021

Heterogeneous Dataflow Accelerators for Multi-DNN Workloads

Hyoukjun Kwon, Liangzhen La, Michael Pellauer, Tushar Krishna, Yu-Hsin Chen, Vikas Chandra

MLSys - April 8, 2021

CPR: Understanding and Improving Failure Tolerant Training for Deep Learning Recommendation with Partial Recovery

Kiwan Maeng, Shivam Bharuka, Isabel Gao, Mark C. Jeffrey, Vikram Saraph, Bor-Yiing Su, Caroline Trippel, Jiyan Yang, Mike Rabbat, Brandon Lucia, Carole-Jean Wu

TSE - January 1, 2020

Approximate Oracles and Synergy in Software Energy Search Spaces

Bobby R. Bruce, Justyna Petke, Mark Harman, Earl T. Barr

OOPSLA - October 25, 2019

Getafix: Learning to Fix Bugs Automatically

Johannes Bader, Andrew Scott, Michael Pradel, Satish Chandra

To help personalize content, tailor and measure ads, and provide a safer experience, we use cookies. By clicking or navigating the site, you agree to allow our collection of information on and off Facebook through cookies. Learn more, including about available controls: Cookies Policy