Semantic Hashing using Tags and Topic Modeling

ACM Special Interest Group on Information Retrieval Conference (SIGIR)


It is an important research problem to design efficient and effective solutions for large scale similarity search. One popular strategy is to represent data examples as compact binary codes through semantic hashing, which has produced promising results with fast search speed and low storage cost. Many existing semantic hashing methods generate binary codes for documents by modeling document relationships based on similarity in a keyword feature space. Two major limitations in existing methods are: (1) Tag information is often associated with documents in many real world applications, but has not been fully exploited yet; (2) The similarity in keyword feature space does not fully reflect semantic relationships that go beyond keyword matching.

This paper proposes a novel hashing approach, Semantic Hashing using Tags and Topic Modeling (SHTTM), to incorporate both the tag information and the similarity information from probabilistic topic modeling. In particular, a unified framework is designed for ensuring hashing codes to be consistent with tag information by a formal latent factor model and preserving the document topic/semantic similarity that goes beyond keyword matching. An iterative coordinate descent procedure is proposed for learning the optimal hashing codes. An extensive set of empirical studies on four different datasets has been conducted to demonstrate the advantages of the proposed SHTTM approach against several other state-of-the-art semantic hashing techniques. Furthermore, experimental results indicate that the modeling of tag information and utilizing topic modeling are beneficial for improving the effectiveness of hashing separately, while the combination of these two techniques in the unified framework obtains even better results.

Related Publications

All Publications

POPL - January 16, 2022

Concurrent Incorrectness Separation Logic

Azalea Raad, Josh Berdine, Derek Dreyer, Peter O'Hearn

HOTI - November 1, 2021

Scalable Distributed Training of Recommendation Models: An ASTRA-SIM + NS3 case-study with TCP/IP transport

Saeed Rashidi, Pallavi Shurpali, Srinivas Sridharan, Naader Hassani, Dheevatsa Mudigere, Krishnakumar Nair, Misha Smelyanskiy, Tushar Krishna

ICSE - November 17, 2021

Automatic Testing and Improvement of Machine Translation

Zeyu Sun, Jie M. Zhang, Mark Harman, Mike Papadakis, Lu Zhang

ACM OOPSLA - October 22, 2021

VESPA: Static Profiling for Binary Optimization

Angélica Aparecida Moreira, Guilherme Ottoni, Fernando Magno Quintão Pereira

To help personalize content, tailor and measure ads, and provide a safer experience, we use cookies. By clicking or navigating the site, you agree to allow our collection of information on and off Facebook through cookies. Learn more, including about available controls: Cookie Policy