Publication

Twine: A Unified Cluster Management System for Shared Infrastructure

USENIX Symposium on Operating Systems Design and Implementation (OSDI)


Abstract

We present Twine, Facebook’s cluster management system which has been running in production for the past decade. Twine has helped convert our infrastructure from a collection of siloed pools of customized machines dedicated to individual workloads, into a large-scale shared infrastructure with fungible hardware.

Our goal of ubiquitous shared infrastructure leads us to some decisions counter to common practices. For instance, rather than deploying an isolated control plane per cluster, Twine scales a single control plane to manage one million machines across all data centers in a geographic region and transparently move jobs across clusters.

Twine accommodates workload-specific customization in shared infrastructure, and this approach further departs from common practices. The TaskControl API allows an application to collaborate with Twine to handle container lifecycle events, e.g., restarting a ZooKeeper deployment’s followers first and its leader last during a rolling upgrade. Host profiles capture hardware and OS settings that workloads can tune to improve performance and reliability; Twine dynamically allocates machines to workloads and switches host profiles accordingly.

Finally, going against the conventional wisdom of prioritizing stacking workloads on big machines to increase utilization, we universally deploy power-efficient small machines outfit with a single CPU and 64GB RAM to achieve higher performance per watt, and we leverage autoscaling to improve machine utilization.

We describe the design of Twine and share our experience in migrating Facebook’s workloads onto shared infrastructure.

Related Publications

All Publications

IEEE TSE - February 17, 2021

Machine Learning Testing: Survey, Landscapes and Horizons

Jie M. Zhang, Mark Harman, Lei Ma, Yang Liu

IMC - October 21, 2019

Internet Performance from Facebook’s Edge

Brandon Schlinker, Italo Cunha, Yi-Ching Chiu, Srikanth Sundaresan, Ethan Katz-Bassett

CC - March 3, 2021

Lightning BOLT: Powerful, Fast, and Scalable Binary Optimization

Maksim Panchenko, Rafael Auler, Laith Sakka, Guilherme Ottoni

USENIX FAST - February 23, 2021

Facebook’s Tectonic Filesystem: Efficiency from Exascale

Satadru Pan, Theano Stavrinos, Yunqiao Zhang, Atul Sikaria, Pavel Zakharov, Abhinav Sharma, Shiva Shankar, Mike Shuey, Richard Wareing, Monika Gangapuram, Guanglei Cao, Christian Preseau, Pratap Singh, Kestutis Patiejunas, JR Tipton, Ethan Katz-Bassett, Wyatt Lloyd

To help personalize content, tailor and measure ads, and provide a safer experience, we use cookies. By clicking or navigating the site, you agree to allow our collection of information on and off Facebook through cookies. Learn more, including about available controls: Cookies Policy