June 1, 2010

Not-so-latent dirichlet allocation: collapsed Gibbs sampling using human judgments

Conference of the North American Chapter of the Association for Computational Linguistics (NAACL)

By: Jonathan Chang

Abstract

Probabilistic topic models are a popular tool for the unsupervised analysis of text, providing both a predictive model of future text and a latent topic representation of the corpus. Recent studies have found that while there are suggestive connections between topic models and the way humans interpret data, these two often disagree.

In this paper, we explore this disagreement from the perspective of the learning process rather than the output. We present a novel task, tag-and-cluster, which asks subjects to simultaneously annotate documents and cluster those annotations. We use these annotations as a novel approach for constructing a topic model, grounded in human interpretations of documents.

We demonstrate that these topic models have features which distinguish them from traditional topic models.