June 11, 2016

Treadmill: Attributing the Source of Tail Latency through Precise Load Testing and Statistical Inference

International Symposium on Computer Architecture

By: Yunqi Zhang, David Meisner, Jason Mars, Lingjia Tang

Abstract

Managing tail latency of requests has become one of the primary challenges for large-scale Internet services. Data centers are quickly evolving and service operators frequently desire to make changes to deployed software and hardware features. Such changes demand a confident understanding of the impact on one’s service, in particular its effect on tail latency (e.g., 95th- or 99th-percentile response latency of the service). Evaluating the impact on the tail is challenging because of its inherent variability. Existing tools and methodologies for measuring these effects suffer from a number of deficiencies including poor load tester design, statistically inaccurate aggregation, and improper attribution of effects. As shown in the paper, these pitfalls can often result in misleading conclusions.

In this paper, we develop a methodology for statistically rigorous performance evaluation for server workloads. First, we find that the design of the server load tester is critical to ensuring quality results and empirically demonstrate the inaccuracy of load testers in previous work. Learning from these flaws, we design and develop a modular load tester platform, Treadmill, that overcomes many pitfalls of existing tools. Next, utilizing Treadmill, we then construct measurement and analysis procedures that can properly attribute performance factors. We build on prior research in statistically-sound performance evaluation and quantile regression, extending it to accommodate the idiosyncrasies of server systems. Finally, we use our augmented methodology to evaluate the utility of common server hardware features with Facebook production workloads on production hardware. We decompose the effects of these features on request tail latency and demonstrate that our evaluation methodology provides superior results, particularly in capturing complicated and counter-intuitive performance behaviors. By tuning the hardware features as suggested by the attribution, we reduce the 99th-percentile latency by 43% and its variance by 93%.