Large-scale Internet services run on a fleet of distributed servers, and the continuous availability of the hardware is key to the robustness of the services. Unplanned reboots disrupt the services running on the hardware and lower the fleet availability. Server reboots are also important signals that could indicate underlying issues such as memory leaks from the services, catastrophic hardware failures, and network or power disruptions at the datacenters.
In this paper, we present an at-scale, near-realtime reboot monitoring framework built with multiple state-of-the-art data infrastructures, as well as machine learning-based anomaly detection and automated root cause analysis across hundreds of server attribute combinations. We observed that 1% of the reboots in our hardware fleet were associated with kernel panics and out-of-memory events, and these reboots exhibit strong locality temporally and across services.