Research Area
Year Published

154 Results

April 20, 2020

Optimizing Interrupt Handling Performance for Memory Failures in Large Scale Data Centers

International Conference on Performance Engineering (ICPE)

In this paper, we describe common intermittent hardware errors observed on server systems in a large-scale data center environment. We discuss two methodologies of handling interrupts in server systems – System Management Interrupt (SMI) and Corrected Machine Check Interrupt (CMCI).

By: Harish Dattatraya Dixit, Fred Lin, Bill Holland, Matt Beadon, Zhengyu Yang, Sriram Sankar

March 16, 2020

Accelerometer: Understanding Acceleration Opportunities for Data Center Overheads at Hyperscale

Architectural Support for Programming Languages and Operating Systems (ASPLOS)

At global user population scale, important microservices in warehouse-scale data centers can grow to account for an enormous installed base of servers. With the end of Dennard scaling, successive server generations running these microservices exhibit diminishing performance returns. Hence, it is imperative to understand how important microservices spend their CPU cycles to determine acceleration opportunities across the global server feet. To this end, we first undertake a comprehensive characterization of the top seven microservices that run on the compute-optimized data center feet at Facebook.

By: Akshitha Sriraman, Abhishek Dhanotia

March 2, 2020

Federated Optimization in Heterogenous Networks

Conference on Machine Learning and Systems (MLSys)

Federated Learning is a distributed learning paradigm with two key challenges that differentiate it from traditional distributed optimization: (1) significant variability in terms of the systems characteristics on each device in the network (systems heterogeneity), and (2) non-identically distributed data across the network (statistical heterogeneity). In this work, we introduce a framework, FedProx, to tackle heterogeneity in federated networks.

By: Tian Li, Anit Kumar Sahu, Manzil Zaheer, Maziar Sanjabi, Ameet Talwalkar, Virginia Smith

February 24, 2020

Characterizing, Modeling, and Benchmarking RocksDB Key-Value Workloads at Facebook

USENIX Conference on File and Storage Technologies (FAST)

In this paper, we first present a detailed characterization of workloads from three typical RocksDB production use cases at Facebook: UDB (a MySQL storage layer for social graph data), ZippyDB (a distributed key-value store), and UP2X (a distributed key-value store for AI/ML services).

By: Zhichao Cao, Siying Dong, Sagar Vemuri, David H.C. Du

February 17, 2020

The Architectural Implications of Facebook’s DNN-based Personalized Recommendation

International Symposium on High Performance Computer Architecture (HPCA)

The widespread application of deep learning has changed the landscape of computation in data centers. In particular, personalized recommendation for content ranking is now largely accomplished using deep neural networks. However, despite their importance and the amount of compute cycles they consume, relatively little research attention has been devoted to recommendation systems. To facilitate research and advance the understanding of these workloads, this paper presents a set of real-world, production-scale DNNs for personalized recommendation coupled with relevant performance metrics for evaluation.

By: Udit Gupta, Carole-Jean Wu, Xiaodong Wang, Maxim Naumov, Brandon Reagen, David Brooks, Bradford Cottel, Kim Hazelwood, Mark Hempstead, Bill Jia, Hsien-Hsin S. Lee, Andrey Malevich, Dheevatsa Mudigere, Mikhail Smelyanskiy, Liang Xiong, Xuan Zhang

January 23, 2020

Incorrectness Logic

Symposium on Principles of Programming Languages (POPL)

Program correctness and incorrectness are two sides of the same coin. As a programmer, even if you would like to have correctness, you might find yourself spending most of your time reasoning about incorrectness. This includes informal reasoning that people do while looking at or thinking about their code, as well as that supported by automated testing and static analysis tools. This paper describes a simple logic for program incorrectness which is, in a sense, the other side of the coin to Hoare’s logic of correctness.

By: Peter O'Hearn

December 2, 2019

PyTorch: An Imperative Style, High-Performance Deep Learning Library

Neural Information Processing Systems (NeurIPS)

In this paper, we detail the principles that drove the implementation of PyTorch and how they are reflected in its architecture. We emphasize that every aspect of PyTorch is a regular Python program under the full control of its user. We also explain how the careful and pragmatic implementation of the key components of its runtime enables them to work together to achieve compelling performance.

By: Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Köpf, Edward Yang, Zach DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, Soumith Chintala

November 30, 2019

MVFST-RL: An Asynchronous RL Framework for Congestion Control with Delayed Actions

Workshop on ML for Systems at NeurIPS

We present MVFST-RL, a scalable framework for congestion control in the QUIC transport protocol that leverages state-of-the-art in asynchronous RL training with off-policy correction. We analyze modeling improvements to mitigate the deviation from Markovian dynamics, and evaluate our method on emulated networks from the Pantheon benchmark platform.

By: Viswanath Sivakumar, Tim Rocktäschel, Alexander H. Miller, Heinrich Küttler, Nantas Nardelli, Mike Rabbat, Joelle Pineau, Sebastian Riedel

October 29, 2019

Taiji: Managing Global User Traffic for Large-Scale Internet Services at the Edge

Symposium on Operating Systems Principles (SOSP)

We present Taiji, a new system for managing user traffic for large-scale Internet services that accomplishes two goals: 1) balancing the utilization of data centers and 2) minimizing network latency of user requests.

By: David Chou, Tianyin Xu, Kaushik Veeraraghavan, Andrew Newell, Sonia Margulis, Lin Xiao, Pol Mauri Ruiz, Justin Meza, Kiryong Ha, Shruti Padmanabha, Kevin Cole, Dmitri Perelman

October 20, 2019

Optimizing and Evaluating Transient Gradual Typing

Dynamic Language Symposium

Gradual typing enables programmers to combine static and dynamic typing in the same language. However, ensuring sound interaction between the static and dynamic parts can incur runtime cost. In this paper, we analyze the performance of the transient design for gradual typing in Reticulated Python, a gradually typed variant of Python.

By: Michael M. Vitousek, Jeremy G. Siek, Avik Chaudhuri