November 4, 2013

An Analysis of Facebook Photo Caching

ACM Symposium on Operating Systems Principles (SOSP)

This paper examines the workload of Facebook’s photo-serving stack and the effectiveness of the many layers of caching it employs. Facebook’s image-management infrastructure is complex and geographically distributed. It includes browser caches on end-user systems, Edge Caches at ~20 PoPs, an Origin Cache, and for some kinds of images, additional caching via Akamai. The underlying image storage layer is widely distributed, and includes multiple data centers.

Qi Huang, Ken Birman, Robbert van Renesse, Wyatt Lloyd, Sanjeev Kumar, Harry Li
October 1, 2013

Virtual Network Diagnosis as a Service

ACM Symposium on Cloud Computing (SoCC)

Today’s cloud network platforms allow tenants to construct sophisticated virtual network topologies among their VMs on a shared physical network infrastructure. However, these platforms provide little…

Wenfei Wu, Guohui Wang, Aditya Akella, Anees Shaikh
August 27, 2013

Scuba: Diving into Data at Facebook

International Conference on Very Large Data Bases (VLDB)

Facebook takes performance monitoring seriously. Performance issues can impact over one billion users so we track thousands of servers, hundreds of PB of daily network traffic, hundreds of daily code…

Lior Abraham, John Allen, Oleksandr Barykin, Vinayak Borkar, Bhuwan Chopra, Ciprian Gerea, Dan Merl, Josh Metzler, David Reiss, Subbu Subramanian, Janet Wiener, Okay Zed
August 26, 2013

Unicorn: A System for Searching the Social Graph

International Conference on Very Large Data Bases (VLDB)

Unicorn is an online, in-memory social graph-aware indexing system designed to search trillions of edges between tens of billions of users and entities on thousands of commodity servers. Unicorn is based on standard concepts in information retrieval, but it includes features to promote results with good social proximity. It also supports queries that require multiple round-trips to leaves in order to retrieve objects that are more than one edge away from source nodes.

Mike Curtiss, Iain Becker, Tudor Bosman, Sergey Doroshenko, Lucian Adrian Grijincu, Tom Jackson, Soren Lassen, Philip Pronin, Guanghao Shen, Gintaras Woss, Chao Yang, Ning Zhang, Sriram Sankar
August 26, 2013

XORing Elephants: Novel Erasure Codes for Big Data

International Conference on Very Large Data Bases (VLDB)

Distributed storage systems for large clusters typically use replication to provide reliability. Recently, erasure codes have been used to reduce the large storage overhead of three-replicated systems. Reed-Solomon codes are the standard design choice and their high repair cost is often considered an unavoidable price to pay for high storage efficiency and high reliability.

Maheshwaran Sathiamoorthy, Megasthenis Asteris, Dimitris Papailiopoulos, Alexandros G. Dimakis, Ramkumar Vadali, Scott Chen, Dhruba Borthakur
August 22, 2013

Reciprocal Hash Tables for Nearest Neighbor Search

AAAI Conference on Artificial Intelligence (AI)

Recent years have witnessed the success of hashing techniques in approximate nearest neighbor search. In practice, multiple hash tables are usually employed to retrieve more desired results from all hit buckets of each table. However, there are rare works studying the unified approach to constructing multiple informative hash tables except the widely used random way.

Xianglong Liu, Junfeng He, Bo Lang
August 22, 2013

Weighted Hashing for Fast Large Scale Similarity Search

ACM International Conference on Information and Knowledge Management (CIKM)

Similarity search, or finding approximate nearest neighbors, is an important technique for many applications. Many recent research demonstrate that hashing methods can achieve promising results for large scale similarity search due to its computational and memory efficiency.

Qifan Wang, Dan Zhang, Luo Si
August 12, 2013

Graph Cluster Randomization: Network Exposure to Multiple Universes

ACM Conference on Knowledge Discovery and Data Mining (KDD)

A drawback with A/B testing is that it is poorly suited for experiments involving social interference, when the treatment of individuals spills over to neighboring individuals along an underlying social network. In this work, we propose a novel methodology using graph clustering to analyze average treatment effects under social interference.

Johan Ugander, Brian Karrer, Lars Backstrom, Jon Kleinberg
August 11, 2013

MI2LS: Multi-Instance Learning from Multiple Information Sources

ACM Conference on Knowledge Discovery and Data Mining (KDD)

In Multiple Instance Learning (MIL), each entity is normally expressed as a set of instances. Most of the current MIL methods only deal with the case when each instance is represented by one type of f…

Dan Zhang, Jingrui He, Richard Lawrence
August 9, 2013

Uncertainty in Online Experiments with Dependent Data: An Evaluation of Bootstrap Methods

ACM Conference on Knowledge Discovery and Data Mining (KDD)

Many online experiments exhibit dependence between users and items. For example, in online advertising, observations that have a user or an ad in common are likely to be associated. Because of this, even in experiments involving millions of subjects, the difference in mean outcomes between control and treatment conditions can have substantial variance. Previous theoretical and simulation results demonstrate that not accounting for this kind of dependence structure can result in confidence intervals that are too narrow, leading to inaccurate hypothesis tests.

Eytan Bakshy, Dean Eckles
July 23, 2013

Semantic Hashing using Tags and Topic Modeling

ACM Special Interest Group on Information Retrieval Conference (SIGIR)

It is an important research problem to design efficient and effective solutions for large scale similarity search. One popular strategy is to represent data examples as compact binary codes through semantic hashing, which has produced promising results with fast search speed and low storage cost. Many existing semantic hashing methods generate binary codes for documents by modeling document relationships based on similarity in a keyword feature space. Two major limitations in existing methods are: (1) Tag information is often associated with documents in many real world applications, but has not been fully exploited yet; (2) The similarity in keyword feature space does not fully reflect semantic relationships that go beyond keyword matching.

Qifan Wang, Dan Zhang, Luo Si
July 16, 2013

Selection Effects in Online Sharing: Consequences for Peer Adoption

ACM Conference on Electronic Commerce (EC)

Most models of social contagion take peer exposure to be a corollary of adoption, yet in many settings, the visibility of one’s adoption behavior happens through a separate decision process. In online systems, product designers can define how peer exposure mechanisms work: adoption behaviors can be shared in a passive, automatic fashion, or occur through explicit, active sharing.

Sean J. Taylor, Eytan Bakshy, Sinan Aral
July 14, 2013

TAO: Facebook’s Distributed Data Store for the Social Graph

USENIX Annual Technical Conference 2013

We introduce a simple data model and API tailored for serving the social graph, and TAO, an implementation of this model.

Peter Dimov Hui Ding, George Cabrera, Zach Amsden, Venkat Venkataramani, Lovro Puzar, Harry Li, Anthony Giardullo, Jack Ferris, Sachin Kulkarni, Prasad Chakka, Yee Jiun Song, Nathan Bronson, Mark Marchukov Dmitri Petrov
July 9, 2013

Calling All Facebook Friends: Exploring requests for help on Facebook

AAAI Conference on Weblogs and Social Media (ICWSM)

Past research suggests Facebook use is linked to perceptions of social capital, a concept that taps into the resources people gain from interactions with their social network. In this study, we examin…

Nicole Ellison, Rebecca Gray, Jessica Vitak, Cliff Lampe, Andrew Tresolini Fiore
July 8, 2013

The Anatomy of Large Facebook Cascades

AAAI Conference on Weblogs and Social Media (ICWSM)

When users post photos on Facebook, they have the option of allowing their friends, followers, or anyone at all to subsequently reshare the photo. A portion of the billions of photos posted to Facebook generates cascades of reshares, enabling many additional users to see, like, comment, and reshare the photos.

Alex Dow, Lada Adamic, Adrien Friggeri
July 8, 2013

Families on Facebook

AAAI Conference on Weblogs and Social Media (ICWSM)

This descriptive study of millions of US Facebook users documents “friending” and communication patterns, exploring parent-child relationships across a variety of life stages and gender combinations.

Moira Burke, Lada Adamic, Karyn Marciniak
July 2, 2013

Self-censorship on Facebook

AAAI Conference on Weblogs and Social Media (ICWSM)

We report results from an exploratory analysis examining “last-minute” self-censorship, or content that is filtered after being written, on Facebook. We collected data from 3.9 mil-lion users over 17 days and associate self-censorship behavior with features describing users, their social graph, and the interactions between them.AAAI Conference on Weblogs and Social Media (ICWSM)

Sauvik Das, Adam D. I. Kramer
July 1, 2013

Development and Deployment at Facebook

IEEE Internet Computing 17(4)

More than one billion users log in to Facebook at least once a month to connect and share content with each other. Among other activities, these users upload over 2.5 billion content items every day.

Dror Feitelson, Eitan Frachtenberg, Kent Beck
June 26, 2013

TAO: Facebook’s Distributed Data Store for the Social Graph

USENIX Annual Technical Conference (ATC)

We introduce a simple data model and API tailored for serving the social graph, and TAO, an implementation of this model. TAO is a geographically distributed data store that provides efficient and timely access to the social graph for Facebook’s demanding workload using a fixed set of queries. It is deployed at Facebook, replacing memcache for many data types that fit its model.

Nathan Bronson, Zachary Amsden, George Cabrera III, Prasad Chakka, Peter Dimov, Hui Ding, Jack Ferris, Anthony Giardullo, Sachin Kulkarni, Harry Li, Mark Marchukov, Dmitri Petrov, Lovro Puzar, Yee Jiun Song, Venkat Venkataramani
June 26, 2013

A Solution to the Network Challenges of Data Recovery in Erasure-coded Distributed Storage Systems: A Study on the Facebook Warehouse Cluster

USENIX Workshop on Hot Topics in Storage and File Systems (HotStorage

Erasure codes, such as Reed-Solomon (RS) codes, are being increasingly employed in data centers to combat the cost of reliably storing large amounts of data. Although these codes provide optimal stora…

K.V. Rashmi, Nihar B. Shah, Dikang Gu, Hairong Kuang, Dhruba Borthakur, Kannan Ramchandran