Treadmill: Attributing the Source of Tail Latency through Precise Load Testing and Statistical Inference

International Symposium on Computer Architecture


Managing tail latency of requests has become one of the primary challenges for large-scale Internet services. Data centers are quickly evolving and service operators frequently desire to make changes to deployed software and hardware features. Such changes demand a confident understanding of the impact on one’s service, in particular its effect on tail latency (e.g., 95th- or 99th-percentile response latency of the service). Evaluating the impact on the tail is challenging because of its inherent variability. Existing tools and methodologies for measuring these effects suffer from a number of deficiencies including poor load tester design, statistically inaccurate aggregation, and improper attribution of effects. As shown in the paper, these pitfalls can often result in misleading conclusions.

In this paper, we develop a methodology for statistically rigorous performance evaluation for server workloads. First, we find that the design of the server load tester is critical to ensuring quality results and empirically demonstrate the inaccuracy of load testers in previous work. Learning from these flaws, we design and develop a modular load tester platform, Treadmill, that overcomes many pitfalls of existing tools. Next, utilizing Treadmill, we then construct measurement and analysis procedures that can properly attribute performance factors. We build on prior research in statistically-sound performance evaluation and quantile regression, extending it to accommodate the idiosyncrasies of server systems. Finally, we use our augmented methodology to evaluate the utility of common server hardware features with Facebook production workloads on production hardware. We decompose the effects of these features on request tail latency and demonstrate that our evaluation methodology provides superior results, particularly in capturing complicated and counter-intuitive performance behaviors. By tuning the hardware features as suggested by the attribution, we reduce the 99th-percentile latency by 43% and its variance by 93%.

Related Publications

All Publications

POPL - January 16, 2022

Concurrent Incorrectness Separation Logic

Azalea Raad, Josh Berdine, Derek Dreyer, Peter O'Hearn

HOTI - November 1, 2021

Scalable Distributed Training of Recommendation Models: An ASTRA-SIM + NS3 case-study with TCP/IP transport

Saeed Rashidi, Pallavi Shurpali, Srinivas Sridharan, Naader Hassani, Dheevatsa Mudigere, Krishnakumar Nair, Misha Smelyanskiy, Tushar Krishna

ICSE - November 17, 2021

Automatic Testing and Improvement of Machine Translation

Zeyu Sun, Jie M. Zhang, Mark Harman, Mike Papadakis, Lu Zhang

ACM OOPSLA - October 22, 2021

VESPA: Static Profiling for Binary Optimization

Angélica Aparecida Moreira, Guilherme Ottoni, Fernando Magno Quintão Pereira

To help personalize content, tailor and measure ads, and provide a safer experience, we use cookies. By clicking or navigating the site, you agree to allow our collection of information on and off Facebook through cookies. Learn more, including about available controls: Cookie Policy