Publication

Passive Realtime Datacenter Fault Detection

USENIX Symposium on Networked Systems Design and Implementation (NSDI) 2017


Abstract

Datacenters are characterized by their large scale, stringent reliability requirements, and significant application diversity. However, the realities of employing hardware with small but non-zero failure rates mean that datacenters are subject to significant numbers of failures, impacting the performance of the services that rely on them. To make matters worse, these failures are not always obvious; network switches and links can fail partially, dropping or delaying various subsets of packets without necessarily delivering a clear signal that they are faulty. Thus, traditional fault detection techniques involving end-host or router-based statistics can fall short in their ability to identify these errors.

We describe how to expedite the process of detecting and localizing partial datacenter faults using an end-host method generalizable to most datacenter applications. In particular, we correlate transport-layer flow metrics and network-I/O system call delay at end hosts with the path that traffic takes through the datacenter and apply statistical analysis techniques to identify outliers and localize the faulty link and/or switch(es). We evaluate our approach in a production Facebook front-end datacenter.

Related Publications

All Publications

11-Gbps Broadband Modem-Agnostic Line-of-Sight MIMO Over the Range of 13 km

Yan Yan, Pratheep Bondalapati, Abhishek Tiwari, Chiyun Xia, Andy Cashion, Dawei Zhang, Tobias Tiecke, Qi Tang, Michael Reed, Dudi Shmueli, Hongyu Zhou, Bob Proctor, Joseph Stewart

IEEE GLOBECOM - January 21, 2019

Deep Learning Training in Facebook Data Centers: Design of Scale-up and Scale-out Systems

Maxim Naumov, John Kim, Dheevatsa Mudigere, Srinivas Sridharan, Xiaodong Wang, Whitney Zhao, Serhat Yilmaz, Changkyu Kim, Hector Yuen, Mustafa Ozdal, Krishnakumar Nair, Isabel Gao, Bor-Yiing Su, Jiyan Yang, Mikhail Smelyanskiy

arXiv - September 3, 2020

PyTorch Distributed: Experiences on Accelerating Data Parallel Training

Shen Li, Yanli Zhao, Rohan Verma, Omkar Salpekar, Pieter Noordhuis, Teng Li, Adam Paszke, Jeff Smith, Brian Vaughan, Pritam Damania, Soumith Chintala

VLDB - August 31, 2020

MyRocks: LSM-Tree Database Storage Engine Serving Facebook’s Social Graph

Yoshinori Matsunobu, Siying Dong, Herman Lee

VLDB - August 31, 2020

To help personalize content, tailor and measure ads, and provide a safer experience, we use cookies. By clicking or navigating the site, you agree to allow our collection of information on and off Facebook through cookies. Learn more, including about available controls: Cookies Policy