October 30, 2017

Crowd Intelligence Enhances Automated Mobile Testing

Automated Software Engineering Conference (ASE)

We show that information extracted from crowdbased testing can enhance automated mobile testing. We introduce POLARIZ, which generates replicable test scripts from crowd-based testing, extracting cross-app ‘motif’ events: automatically-inferred reusable higher-level event sequences composed of lower-level observed event actions. Our empirical study used 434 crowd workers from Mechanical Turk to perform 1,350 testing tasks on 9 popular Google Play apps, each with at least 1 million user installs.

Ke Mao, Mark Harman, Yue Jia
September 20, 2017

Characterizing Large-Scale Production Reliability for 100G Optical Interconnect in Facebook Data Centers Data Centers

Frontiers in Optics / Laser Science (FiO/LS)

Facebook is deploying cost effective 100G CWDM4 transceivers in data centers. This paper describes the post production performance monitoring system which is being implemented to identify optical interconnect early failure modes.

Abhijit Chakravarty, Srinivasan Giridharan, Matt Kelly, Ashwin Poojary, Vincent Zeng
September 20, 2017

100Gb/s CWDM4 Optical Interconnect at Facebook Data Centers for Bandwidth Enhancement

Frontiers in Optics / Laser Science (FiO/LS)

Facebook has developed 100G data centers from the ground-up by fine tuning optical technologies, optimizing link-budget, limiting operating temperatures and ultimately improving manufacturability. 100G-CWDM4 is an effective technology to enable connectivity over duplex single-mode fiber.

Abhijit Chakravarty, Katharine Schmidtke, Vincent Zeng, Srinivasan Giridharan, Cathie Deal, Reza Niazmand
August 28, 2017

Social Hash Partitioner: A Scalable Distributed Hypergraph Partitioner

Very Large Data Bases Conference (VLDB)

We design and implement a distributed algorithm for balanced k-way hypergraph partitioning that minimizes fanout, a fundamental hypergraph quantity also known as the communication volume and (k − 1)-cut metric, by optimizing a novel objective called probabilistic fanout. This choice allows a simple local search heuristic to achieve comparable solution quality to the best existing hypergraph partitioners.

Igor Kabiljo, Brian Karrer, Mayank Pundir, Sergey Pupyrev, Alon Shalita
August 21, 2017

Engineering Egress with Edge Fabric: Steering Oceans of Content to the World


Large content providers build points of presence around the world, each connected to tens or hundreds of networks. Ideally, this connectivity lets providers better serve users, but providers cannot obtain enough capacity on some preferred peering paths to handle peak traffic demands. These capacity constraints, coupled with volatile traffic and performance and the limitations of the 20 year old BGP protocol, make it difficult to best use this connectivity. This paper presents Edge Fabric, an SDN-based system we built and deployed to tackle these challenges for Facebook, which serves over two billion users from dozens of points of presence on six continents.

Brandon Schlinker, Hyojeong Kim, Timothy Cui, Ethan Katz-Bassett, Harsha V. Madhyastha, Italo Cunha, James Quinn, Saif Hasan, Petr Lapukhov, James Hongyi Zeng
August 21, 2017

SilkRoad: Making Stateful Layer-4 Load Balancing Fast and Cheap Using Switching ASICs

Association for Computing Machinery's Special Interest Group on Data Communications (SIGCOMM)

In this paper, we show that up to hundreds of software load balancer (SLB) servers can be replaced by a single modern switching ASIC, potentially reducing the cost of load balancing by over two orders of magnitude. Today, large data centers typically employ hundreds or thousands of servers to load-balance incoming traffic over application servers.

Rui Miao, Hongyi Zeng, Changhoon Kim, Jeongkeun Lee, Minlan Yu
April 27, 2017

Passive Realtime Datacenter Fault Detection

USENIX Symposium on Networked Systems Design and Implementation (NSDI) 2017

We describe how to expedite the process of detecting and localizing partial datacenter faults using an end-host method generalizable to most datacenter applications.

Arjun Roy, James Hongyi Zeng, Jasmeet Bagga, Alex C. Snoeren
April 19, 2017

Joint User-Entity Representation Learning for Event Recommendation in Social Network

2017 IEEE 33rd International Conference on Data Engineering (ICDE)

In this work, we consider the heavy sparseness in both user and event feedback history caused by short lifespans (transiency) of events and user participation patterns in a production event system. We propose to solve the resulting cold-start problems by introducing a joint representation model to project users and events into the same latent space.

Lijun Tang, Eric Yi Liu
April 1, 2017

Spinner: Scalable Graph Partitioning in the Cloud

IEEE International Conference on Data Engineering (ICDE)

In this paper, we present a graph partitioning algorithm to partition graphs with trillions of edges.

Claudio Martella, Dionysios Logothetis, Andreas Loukas, Georgos Siganos
February 4, 2017

Optimizing Function Placement for Large-Scale Data-Center Applications

International Symposium on Code Generation and Optimization (CGO)

We study the impact of function placement in the context of a simple tool we created that uses sample-based profiling data.

Guilherme Ottoni, Bertrand Maher
January 8, 2017

Optimizing Space Amplification in RocksDB

CIDR 2017

RocksDB is an embedded, high-performance, persistent key-value storage engine developed at Facebook.

Siying Dong, Mark Callaghan, Leonidas Galanis, Dhruba Borthakur, Tony Savor, Michael Stumm
November 16, 2016

Performance or Capacity? Different Approaches for Different Tasks

International Conference for Performance and Capacity (CMGimPACt)

Measurement and aggregation approaches that are used in performance monitoring are not always useful for capacity planning, while approaches that we use in capacity planning are often meaningless for performance analysis. This paper explores this gap and discusses ways to reconcile the two tasks.

Alexander Gilgur, Steve Politis
November 13, 2016

Continuous Deployment of Mobile Software at Facebook (Showcase)

ACM SIGSOFT: International Symposium on the Foundations of Software Engineering (FSE 2016)

This paper describes in detail the software update mobile deployment process at Facebook.

Chuck Rossi, Elisa Shibley, Shi Su, Kent Beck, Tony Savor, Michael Stumm
November 8, 2016

Performance Or Capacity

CMG imPACt, Conference by the Computer Measurement Group

We explore the gap between measurement and aggregation approaches used in performance monitoring, which are not always useful for capacity planning, vs approaches used in capacity planning are often meaningless for performance analysis, and discusses ways to reconcile the two tasks.

Alexander Gilgur, Steve Politis
November 2, 2016

DQBarge: Improving Data-Quality Tradeoffs in Large-Scale Internet Services

OSDI 2016

DQBarge is a system that enables better data-quality tradeoffs by propagating critical information along the causal path of request processing.

Jason Flinn, Kaushik Veeraraghavan, Michael Cafarella, Michael Chow
November 2, 2016

Kraken: Leveraging Live Traffic Tests to Identify and Resolve Resource Utilization Bottlenecks in Large Scale Web Services

OSDI 2016

Kraken is a new system that runs load tests by continually shifting live user traffic to one or more data centers.

Kaushik Veeraraghavan, Justin Meza, David Chou, Wonho Kim, Sonia Margulis, Scott Michelson, Rajesh Nishtala, Daniel Obenshain, Dmitri Perelman, Yee Jiun Song
September 25, 2016

Desugaring Haskell’s do-Notation into Applicative Operations

Haskell Symposium 2016

In this paper we show how to re-use the very same do-notation to work for Applicatives as well, providing efficiency benefits for some types that are both Monad and Applicative, and syntactic convenience for those that are merely Applicative.

Simon Marlow, Simon Peyton Jones, Edward Kmett, Andrey Mokhov
September 5, 2016

Cubrick: Indexing Millions of Records per Second for Interactive Analytics

VLDB 2016

This paper describes the architecture and design of Cubrick, a distributed multidimensional in-memory DBMS suited for interactive analytics over highly dynamic datasets.

Pedro Eugenio Rocha Pedreira, Chris Croswhite, Luis Bona
September 1, 2016

Desugaring Haskell’s do-notation Into Applicative Operations

ACM SIGPLAN Haskell Sympoisum

In this paper we show how to re-use the very same do-notation to work for Applicatives as well, providing efficiency benefits for some types that are both Monad and Applicative, and syntactic convenience for those that are merely Applicative.

Simon Marlow, Simon Peyton Jones, Edward Kmett, Andrey Mokhov
August 23, 2016

Robotron: Top-down Network Management at Facebook Scale


In this paper, we present Robotron, a system for managing a massive production network in a top-down fashion.

Yu-Wei Eric Sung, Xiaozheng Tie, Starsky H.Y. Wong, James Hongyi Zeng