Predicting Remediations for Hardware Failures in Large-Scale Datacenters

IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)


Large-scale service environments rely on autonomous systems for remediating hardware failures efficiently. In production, the autonomous system diagnoses hardware failures based on the rules that the subject matter experts put in the system. This process is increasingly complex given new types of failures and the increasing complexity in the hardware and software configurations.

In this paper, we present a machine learning framework that predicts the required remediations for undiagnosed failures, based on the similar repair tickets closed in the past. We explain the methodology in detail for setting up a machine learning model, deploying it in a production environment, and monitoring its performance with the necessary metrics. We also demonstrate the prediction performance on some of the repair actions.

Related Publications

All Publications

ICML - July 18, 2021

Latency-Aware Neural Architecture Search with Multi-Objective Bayesian Optimization

David Eriksson, Pierce I-Jen Chuang, Samuel Daulton, Peng Xia, Akshat Shrivastava, Arun Babu, Shicong Zhao, Ahmed Aly, Ganesh Venkatesh, Maximilian Balandat

ISAAC - December 5, 2021

On the Extended TSP Problem

Julián Mestre, Sergey Pupyrev, Seeun William Umboh

ICML - July 18, 2021

Variational Auto-Regressive Gaussian Processes for Continual Learning

Sanyam Kapoor, Theofanis Karaletsos, Thang D. Bui

AKBC - October 3, 2021

Relation Prediction as an Auxiliary Training Objective for Improving Multi-Relational Graph Representations

Yihong Chen, Pasquale Minervini, Sebastian Riedel, Pontus Stenetorp

To help personalize content, tailor and measure ads, and provide a safer experience, we use cookies. By clicking or navigating the site, you agree to allow our collection of information on and off Facebook through cookies. Learn more, including about available controls: Cookies Policy