Publication

Predicting Remediations for Hardware Failures in Large-Scale Datacenters

IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)


Abstract

Large-scale service environments rely on autonomous systems for remediating hardware failures efficiently. In production, the autonomous system diagnoses hardware failures based on the rules that the subject matter experts put in the system. This process is increasingly complex given new types of failures and the increasing complexity in the hardware and software configurations.

In this paper, we present a machine learning framework that predicts the required remediations for undiagnosed failures, based on the similar repair tickets closed in the past. We explain the methodology in detail for setting up a machine learning model, deploying it in a production environment, and monitoring its performance with the necessary metrics. We also demonstrate the prediction performance on some of the repair actions.

Related Publications

All Publications

DELF: Safeguarding deletion correctness in Online Social Networks

Katriel Cohn-Gordon, Georgios Damaskinos, Divino Neto, Joshi Cordova, BenoƮt Reitz, Benjamin Strahs, Daniel Obenshain, Paul Pearce, Ioannis Papagiannis

USENIX Security - August 11, 2020

Spatially Aware Multimodal Transformers for TextVQA

Yash Kant, Dhruv Batra, Peter Anderson, Alexander Schwing, Devi Parikh, Jiasen Lu, Harsh Agrawal

ECCV - August 23, 2020

Perceiving 3D Human-Object Spatial Arrangements from a Single Image in the Wild

Jason Y. Zhang, Sam Pepose, Hanbyul Joo, Deva Ramanan, Jitendra Malik, Angjoo Kanazawa

ECCV - August 23, 2020

ContactPose: A Dataset of Grasps with Object Contact and Hand Pose

Samarth Brahmbhatt, Chengcheng Tang, Christopher D. Twigg, Charles C. Kemp, James Hays

ECCV - August 23, 2020

To help personalize content, tailor and measure ads, and provide a safer experience, we use cookies. By clicking or navigating the site, you agree to allow our collection of information on and off Facebook through cookies. Learn more, including about available controls: Cookies Policy