Publication

Passive Realtime Datacenter Fault Detection and Localization

USENIX Symposium on Networked Systems Design and Implementation (NSDI)


Abstract

Datacenters are characterized by their large scale, comprising a large number of network links and switches. However, these hardware components can develop intermittent faults, resulting in randomly occurring packet drops or delays that harm application performance—several such faults occur daily in large production datacenters. Since the effects are intermittent, traditional detection techniques involving end-host and router statistics or active probe traffic can fall short in their ability to identify and locate these errors. In this article, we present our passive hybrid approach that combines network path information with end-host-based statistics to rapidly detect and pinpoint the location of datacenter network faults inside a production Facebook datacenter.

Related Publications

All Publications

ACM SIGCOMM - August 23, 2021

Network Planning with Deep Reinforcement Learning

Hang Zhu, Varun Gupta, Satyajeet Singh Ahuja, Yuandong Tian, Ying Zhang, Xin Jin

ACM SIGCOMM - July 30, 2021

ARROW: Restoration-Aware Traffic Engineering

Zhizhen Zhong, Manya Ghobadi, Alaa Khaddaj, Jonathan Leach, Yiting Xia, Ying Zhang

ACM SIGCOMM - August 23, 2021

Capacity-Efficient and Uncertainty-Resilient Backbone Network Planning with Hose

Satyajeet Singh Ahuja, Varun Gupta, Vinayak Dangui, Soshant Bali, Abishek Gopalan, Hao Zhong, Petr Lapukhov, Yiting Xia, Ying Zhang

Microwave Journal - June 16, 2021

Combining CLOS and NLOS Microwave Backhaul to Help Solve the Rural Connectivity Challenge

Erik Boch, Julius Kusuma

To help personalize content, tailor and measure ads, and provide a safer experience, we use cookies. By clicking or navigating the site, you agree to allow our collection of information on and off Facebook through cookies. Learn more, including about available controls: Cookies Policy