May 17, 2013

Latent Credibility Analysis

International World Wide Web Conference (WWW)

A frequent problem when dealing with data gathered from multiple sources on the web (ranging from booksellers to Wikipedia pages to stock analyst predictions) is that these sources disagree, and we must decide which of their (often mutually exclusive) claims we should accept. Current state-of-the-art information credibility algorithms known as ‘fact-finders’ are transitive voting systems with rules specifying how votes iteratively flow from sources to claims and then back to sources.

Jeff Pasternack, Dan Roth
May 8, 2013

CopyCatch: Stopping Group Attacks by Spotting Lockstep Behavior in Social Networks

International World Wide Web Conference (WWW)

In this paper we focus on the social network Facebook and the problem of discerning ill-gotten Page Likes, made by spammers hoping to turn a profit, from legitimate Page Likes. Our method, which we refer to as CopyCatch, detects lockstep Page Like patterns on Facebook by analyzing only the social graph between users and Pages and the times at which the edges in the graph (the Likes) were created.

Alex Beutel, Tom Wanhong Xu, Venkatesan Guruswami, Christopher Palow, Christos Faloutsos
May 1, 2013

Hash Bit Selection: a Unified Solution for Selection Problems in Hashing

Conference on Computer Vision and Pattern Recognition (CVPR)

Hashing based methods recently have been shown promising for large-scale nearest neighbor search. However, good designs involve difficult decisions of many unknowns – data features, hashing algorithms, parameter settings, kernels, etc.

Xianglong Liu, Junfeng He, Bo Lang, Shih-Fu Chang
August 12, 2012

Active Sampling for Entity Matching

ACM Conference on Knowledge Discovery and Data Mining (KDD)

In entity matching, a fundamental issue while training a classifier to label pairs of entities as either duplicates or non-duplicates is the one of selecting informative examples. Although active learning presents an attractive solution to this problem, previous approaches minimize the misclassification rate (0-1 loss) of the classifier, which is an unsuitable metric for entity matching due to class imbalance (i.e., many more non-duplicate pairs than duplicate pairs).

Kedar Bellare, Suresh Iyengar Parthasarathy, Aditya Parameswaran, Vibhor Rastogi
March 1, 2012

Bootstrapping Data Arrays of Arbitrary Order

The Annals of Applied Statistics (AOAS)

In this paper we study a bootstrap strategy for estimating the variance of a mean taken over large multifactor crossed random effects data sets. We apply bootstrap reweighting independently to the lev…

Art B. Owen, Dean Eckles
August 15, 2011

Phonetic Classification Using Controlled Random Walks

Conference of the International Speech Communication Association (Interspeech)

Recently, semi-supervised learning algorithms for phonetic classifiers have been proposed that have obtained promising results. Often, these algorithms attempt to satisfy learning criteria that are not inherent in the standard generative or discriminative training procedures for phonetic classifiers.

Katrin Kirchhoff, Andrei Alexandrescu
July 24, 2011

Learning Relevance from a Heterogeneous Social Network and Its Application in Online Targeting

ACM Special Interest Group on Information Retrieval (SIGIR)

The rise of social networking services in recent years presents new research challenges for matching users with interesting content. While the content-rich nature of these social networks offers many…

Chi Wang, Rajat Raina, David Fong, Ding Zhou, Jiawei Han, Greg Badros
June 20, 2011

YSmart: Yet Another SQL-to-MapReduce Translator

International Conference on Distributed Computing Systems (ICDCS)

MapReduce has become an effective approach to big data analytics in large cluster systems, where SQL-like queries play important roles to interface between users and systems. However, based on our Face book daily operation results, certain types of queries are executed at an unacceptable low speed by Hive (a production SQL-to-MapReduce translator). In this paper, we demonstrate that existing SQL-to-MapReduce translators that operate in a one-operation-to-one-job mode and do not consider query correlations cannot generate high-performance MapReduce programs for certain queries, due to the mismatch between complex SQL structures and simple MapReduce framework. We propose and develop a system called Y Smart, a correlation aware SQL-to-MapReduce translator. Y Smart applies a set of rules to use the minimal number of MapReduce jobs to execute multiple correlated operations in a complex query. Y Smart can significantly reduce redundant computations, I/O operations and network transfers compared to existing translators. We have implemented Y Smart with intensive evaluation for complex queries on two Amazon EC2 clusters and one Face book production cluster. The results show that Y Smart can outperform Hive and Pig, two widely used SQL-to-MapReduce translators, by more than four times for query execution.

Rubao Lee, Tian Luo, Yin Huai, Fusheng Wang, Yongqiang He, Xiaodong Zhang
January 1, 2011

Supervised Random Walks: Predicting and Recommending Links in Social Networks

ACM International Conference on Web Search and Data Mining (WSDM)

Predicting the occurrence of links is a fundamental problem in networks. In the link prediction problem we are given a snapshot of a network and would like to infer which interactions among existing members are likely to occur in the near future or which existing interactions are we missing. Although this problem has been extensively studied, the challenge of how to effectively combine the information from the network structure with rich node and edge attribute data remains largely open.

Lars Backstrom, Jure Leskovec
June 1, 2010

Not-so-latent dirichlet allocation: collapsed Gibbs sampling using human judgments

Conference of the North American Chapter of the Association for Computational Linguistics (NAACL)

Probabilistic topic models are a popular tool for the unsupervised analysis of text, providing both a predictive model of future text and a latent topic representation of the corpus. Recent studies have found that while there are suggestive connections between topic models and the way humans interpret data, these two often disagree.

Jonathan Chang
June 1, 2010

Tools for Collecting Speech Corpora via Mechanical Turk

NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon's Mechanical Turk

To rapidly port speech applications to new languages one of the most difficult tasks is the initial collection of sufficient speech corpora.

Ian Lane, Alex Waibel, Matthias Eck, Kay Rottmann
April 19, 2010

ePluribus: Ethnicity on Social Networks

AAAI CONFERENCE ON WEBLOGS AND SOCIAL MEDIA (ICWSM)

We propose an approach to determine the ethnic break-down of a population based solely on people’s names and data provided by the U.S. Census Bureau. We demonstrate that our approach is able to predict the ethnicities of individuals as well as the ethnicity of an entire population better than natural alternatives.

Jonathan Chang, Itamar Rosenn, Lars Backstrom, Cameron Marlow