A Large Scale Study of Data Center Network Reliability

Internet Measurement Conference (ICM)


The ability to tolerate, remediate, and recover from network incidents (e.g., caused by device failures and fiber cuts) is critical for building and operating highly-available web services. Achieving fault tolerance and failure preparedness requires system architects, software developers, and site operators to have a deep understanding of network reliability at scale, along with its implications to data center systems. Unfortunately, little has been reported on the reliability characteristics of large-scale data center network infrastructure, let alone its impact on the availability of services powered by software running on that network infrastructure (service-level availability).

This paper fills the gap by presenting a large-scale, longitudinal study of data center network reliability based on operational data collected from the production network infrastructure at Facebook, one of the largest web service providers in the world. The study covers reliability characteristics of both intra and inter data center networks. For intra data center networks, we study seven years of operation data comprising thousands of network incidents across two different data center network designs – a classic cluster-based architecture and a state-of-the-art fabric-based topology. For inter data center networks, we study eighteen months of recent repair tickets in the field to understand reliability of WAN backbones. In contrast to prior work, we study the effects of network reliability on web services, and how these reliability characteristics evolve over time. We discuss the implications of network reliability on the design, implementation, and operation of large-scale data center systems and how it affects highly-available web services. We hope our study forms the foundation of understanding the reliability of large-scale network infrastructure, and inspires new reliability solutions to network incidents.

Related Publications

All Publications

DSN - June 21, 2021

Near-Realtime Server Reboot Monitoring and Root Cause Analysis in a Large-Scale System

Fred Lin, Bhargav Bolla, Eric Pinkham, Neil Kodner, Daniel Moore, Amol Desai, Sriram Sankar

ISCA - June 14, 2021

Enabling Compute-Communication Overlap in Distributed Deep Learning Training Platforms

Saeed Rashidi, Matthew Denton, Srinivas Sridharan, Sudarshan Srinivasan, Amoghavarsha Suresh, Jade Nie, Tushar Krishna

MLSys - May 19, 2021

TT-Rec: Tensor Train Compression For Deep Learning Recommendation Model Embeddings

Chunxing Yin, Bilge Acun, Xing Liu, Carole-Jean Wu

ICSE - May 21, 2020

Debugging Crashes using Continuous Contrast Set Mining

Rebecca Qian, Yang Yu, Wonhee Park, Vijayaraghavan Murali, Stephen Fink, Satish Chandra

To help personalize content, tailor and measure ads, and provide a safer experience, we use cookies. By clicking or navigating the site, you agree to allow our collection of information on and off Facebook through cookies. Learn more, including about available controls: Cookies Policy