November 4, 2020Chunqiang (CQ) Tang, Kenny Yu, Kaushik Veeraraghavan, Jonathan Kaldor, Scott Michelson, Thawan Kooburat, Aravind Anbudurai, Matthew Clark, Kabir Gogia, Long Cheng, Ben Christensen, Alex Gartrell, Maxim Khutornenko, Sachin Kulkarni, Marcin Pawlowski, Tuomas Pelkonen, Andre Rodrigues, Rounak Tibrewal, Vaishnavi Venkatesan, Peter Zhang
We present Twine, Facebook’s cluster management system which has been running in production for the past decade. Twine has helped convert our infrastructure from a collection of siloed pools of customized machines dedicated to individual workloads, into a large-scale shared infrastructure with fungible hardware.
This paper presents a different approach to cache development, successfully employed at Facebook, which extracts a core set of common requirements and functionality from otherwise disjoint caching systems. CacheLib is a general-purpose caching engine, designed based on experiences with a range of caching use cases at Facebook, that facilitates the easy development and maintenance of caches.
This paper introduces FlightTracker, a family of APIs and systems which now manage consistency for online access to Facebook’s graph. FlightTracker implicitly provides RYW and can be explicitly used to provide alternative consistency guarantees for special use cases; it enables flexible communication patterns between caches, which we have found important as the number of datacenters increases; it extends the same consistency guarantees to cross-shard indexes and materialized views, allowing us to transparently optimize queries; and it provides a uniform primitive for clients to obtain desired consistency guarantees across a variety of data stores.
Consensus-based replicated systems are complex, monolithic, and difficult to upgrade once deployed. As a result, deployed systems do not benefit from innovative research, and new consensus protocols rarely reach production. We propose virtualizing consensus by virtualizing the shared log API, allowing services to change consensus protocols without downtime.
FastPay allows a set of distributed authorities, some of which are Byzantine, to maintain a high-integrity and availability settlement system for pre-funded payments. It can be used to settle payments in a native unit of value (crypto-currency), or as a financial side-infrastructure to support retail payments in fiat currencies. FastPay is based on Byzantine Consistent Broadcast as its core primitive, foregoing the expenses of full atomic commit channels (consensus).
The problem caused by simultaneous recharging of batteries in a data center has not been extensively studied and no real-world solutions have been proposed in the literature. In this paper, we identify the problem due to battery recharging with case studies from Facebook’s data centers. We describe the solutions we have developed to coordinate charging of batteries without exceeding the circuit breaker power limit.
Large-scale training is important to ensure high performance and accuracy of machine-learning models. At Facebook we use many different models, including computer vision, video and language models. However, in this paper we focus on the deep learning recommendation models (DLRMs), which are responsible for more than 50% of the training demand in our data centers.
This paper presents the design, implementation, and evaluation of the PyTorch distributed data parallel module. PyTorch is a widely-adopted scientific computing package used in deep learning research and applications. Recent advances in deep learning argue for the value of large datasets and large models, which necessitates the ability to scale out model training to more computational resources.
In this paper, we describe our journey to build and run an OLTP LSMtree SQL database at scale. We also discuss the features we implemented to keep pace with UDB workloads, what made migrations easier, and what operational and software development challenges we faced during the two years of running MyRocks in production.
Using dependent types in production code is a practical way to eliminate errors. While there are many examples of using dependent Haskell to prove invariants about code, few of these are applied to large scale production systems. Critics claim that dependent types are only useful in toy examples and that they are impractical for use in the real world. This experience report analyzes real world examples where dependent types have enabled us to find and eliminate bugs in production Haskell code.