Passive Realtime Datacenter Fault Detection

USENIX Symposium on Networked Systems Design and Implementation (NSDI) 2017


Datacenters are characterized by their large scale, stringent reliability requirements, and significant application diversity. However, the realities of employing hardware with small but non-zero failure rates mean that datacenters are subject to significant numbers of failures, impacting the performance of the services that rely on them. To make matters worse, these failures are not always obvious; network switches and links can fail partially, dropping or delaying various subsets of packets without necessarily delivering a clear signal that they are faulty. Thus, traditional fault detection techniques involving end-host or router-based statistics can fall short in their ability to identify these errors.

We describe how to expedite the process of detecting and localizing partial datacenter faults using an end-host method generalizable to most datacenter applications. In particular, we correlate transport-layer flow metrics and network-I/O system call delay at end hosts with the path that traffic takes through the datacenter and apply statistical analysis techniques to identify outliers and localize the faulty link and/or switch(es). We evaluate our approach in a production Facebook front-end datacenter.

Related Publications

All Publications

TSE - May 6, 2021

Comparative Analysis of Constraint Handling Techniques for Constrained Combinatorial Testing

Huayao Wu, Changhai Nie, Justyna Petke, Yue Jia, Mark Harman

EASE - May 10, 2021

Facebook’s Cyber–Cyber and Cyber–Physical Digital Twins

John Ahlgren, Kinga Bojarczuk, Sophia Drossopoulou, Inna Dvortsova, Johann George, Natalija Gucevska, Mark Harman, Maria Lomeli, Simon Mark Lucas, Erik Meijer, Steve Omohundro, Rubmary Rojas, Silvia Sapora, Jie M. Zhang, Norm Zhou

International Workshop on Mutation Analysis at ICST - May 6, 2021

An Empirical Comparison of Mutant Selection Assessment Metrics

Jie M. Zhang, Lingming Zhang, Dan Hao, Lu Zhang, Mark Harman

NSDI - April 12, 2021

A Social Network Under Social Distancing: Risk-Driven Backbone Management During COVID-19 and Beyond

Yiting Xia, Ying Zhang, Zhizhen Zhong, Guanqing Yan, Chiun Lin Lim, Satyajeet Singh Ahuja, Soshant Bali, Alexander Nikolaidis, Kimia Ghobadi, Manya Ghobadi

To help personalize content, tailor and measure ads, and provide a safer experience, we use cookies. By clicking or navigating the site, you agree to allow our collection of information on and off Facebook through cookies. Learn more, including about available controls: Cookies Policy