Fast Dimensional Analysis for Root Cause Investigation in a Large-Scale Service Environment



Root cause analysis in a large-scale production environment is challenging due to the complexity of services running across global data centers. Due to the distributed nature of a large-scale system, the various hardware, software, and tooling logs are often maintained separately, making it difficult to review the logs jointly for understanding production issues. Another challenge in reviewing the logs for identifying issues is the scale – there could easily be millions of entities, each described by hundreds of features. In this paper we present a fast dimensional analysis framework that automates the root cause analysis on structured logs with improved scalability.

We first explore item-sets, i.e. combinations of feature values, that could identify groups of samples with sufficient support for the target failures using the Apriori algorithm and a subsequent improvement, FP-Growth. These algorithms were designed for frequent item-set mining and association rule learning over transactional databases. After applying them on structured logs, we select the item-sets that are most unique to the target failures based on lift. We propose pre-processing steps with the use of a large-scale real-time database and post-processing techniques and parallelism to further speed up the analysis and improve interpretability, and demonstrate that such optimization is necessary for handling large-scale production datasets. We have successfully rolled out this approach for root cause investigation purposes in a large-scale infrastructure. We also present the setup and results from multiple production use cases in this paper.

Related Publications

All Publications

MLSys - March 1, 2020

Predictive Precompute with Recurrent Neural Networks

Hanson Wang, Zehui Wang, Yuanyuan Ma

CODE - November 20, 2020

Privacy-Preserving Randomized Controlled Trials: A Protocol for Industry Scale Deployment (Extended Abstract)

Mahnush Movahedi, Benjamin M. Case, Andrew Knox, Li Li, Yiming Paul Li, Sanjay Saravanan, Shubho Sengupta, Erik Taubeneck

ACM SIGCOMM - October 26, 2020

Zero Downtime Release: Disruption-free Load Balancing of a Multi-Billion User Website

Usama Naseer, Luca Niccolini, Udip Pant, Alan Frindell, Ranjeeth Dasineni, Theophilus A. Benson

FL-ICML - September 1, 2020

ResiliNet: Failure-Resilient Inference in Distributed Neural Networks

Ashkan Yousefpour, Brian Q. Nguyen, Siddartha Devic, Guanhua Wang, Aboudy Kreidieh, Hans Lobel, Alexandre M. Bayen, Jason P. Jue

To help personalize content, tailor and measure ads, and provide a safer experience, we use cookies. By clicking or navigating the site, you agree to allow our collection of information on and off Facebook through cookies. Learn more, including about available controls: Cookies Policy