Publication

Predicting Remediations for Hardware Failures in Large-Scale Datacenters

IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)


Abstract

Large-scale service environments rely on autonomous systems for remediating hardware failures efficiently. In production, the autonomous system diagnoses hardware failures based on the rules that the subject matter experts put in the system. This process is increasingly complex given new types of failures and the increasing complexity in the hardware and software configurations.

In this paper, we present a machine learning framework that predicts the required remediations for undiagnosed failures, based on the similar repair tickets closed in the past. We explain the methodology in detail for setting up a machine learning model, deploying it in a production environment, and monitoring its performance with the necessary metrics. We also demonstrate the prediction performance on some of the repair actions.

Related Publications

All Publications

CSCW - November 25, 2019

Who Ties the World Together? Evidence from a Large Online Social Network

Guanghua Chi, Bogdan State, Joshua E. Blumenstock, Lada Adamic

CVPR - June 1, 2021

Semi-supervised Synthesis of High-Resolution Editable Textures for 3D Humans

Bindita Chaudhuri, Nikolaos Sarafianos, Linda Shapiro, Tony Tung

NeurIPS - December 6, 2020

High-Dimensional Contextual Policy Search with Unknown Context Rewards using Bayesian Optimization

Qing Feng, Benjamin Letham, Hongzi Mao, Eytan Bakshy

Innovative Technology at the Interface of Finance and Operations - March 31, 2021

Market Equilibrium Models in Large-Scale Internet Markets

Christian Kroer, Nicolas E. Stier-Moses

To help personalize content, tailor and measure ads, and provide a safer experience, we use cookies. By clicking or navigating the site, you agree to allow our collection of information on and off Facebook through cookies. Learn more, including about available controls: Cookies Policy