Silent Data Corruptions at Scale



Silent Data Corruption (SDC) can have negative impact on large- scale infrastructure services. SDCs are not captured by error reporting mechanisms within a Central Processing Unit (CPU) and hence are not traceable at the hardware level. However, the data corruptions propagate across the stack and manifest as application-level problems. These types of errors can result in data loss and can require months of debug engineering time. In this paper, we describe common defect types observed in silicon manufacturing that leads to SDCs. We discuss a real-world example of silent data corruption within a datacenter application. We provide the debug flow followed to root-cause and triage faulty instructions within a CPU using a case study, as an illustration on how to debug this class of errors. We provide a high-level overview of the mitigations to reduce the risk of silent data corruptions within a large production fleet. In our large-scale infrastructure, we have run a vast library of silent error test scenarios across hundreds of thousands of machines in our fleet. This has resulted in hundreds of CPUs detected for these errors, showing that SDCs are a systemic issue across generations. We have monitored SDCs for a period longer than 18 months. Based on this experience, we determine that reducing silent data corruptions requires not only hardware resiliency and production detection mechanisms, but also robust fault-tolerant software architectures.

Related Publications

All Publications

TSE - June 29, 2021

Learning From Mistakes: Machine Learning Enhanced Human Expert Effort Estimates

Federica Sarro, Rebecca Moussa, Alessio Petrozziello, Mark Harman

IEEE ICIP - September 19, 2021

Rate Estimation Techniques for Encoder Parallelization

Gaurang Chaudhari, Hsiao-Chiang Chuang, Igor Koba, Hariharan Lalgudi

RecSys - September 27, 2021

Jointly Optimize Capacity, Latency and Engagement in Large-scale Recommendation Systems

Hitesh Khandelwal, Viet Ha-Thuc, Avishek Dutta, Yining Lu, Nan Du, Zhihao Li, Qi Huang

MLSys - June 9, 2021

Value Learning for Throughput Optimization of Deep Neural Networks

Benoit Steiner, Chris Cummins, Horace He, Hugh Leather

To help personalize content, tailor and measure ads, and provide a safer experience, we use cookies. By clicking or navigating the site, you agree to allow our collection of information on and off Facebook through cookies. Learn more, including about available controls: Cookies Policy