Publication

Training Hybrid Language Models by Marginalizing over Segmentations

Association for Computational Linguistics (ACL)


Abstract

In this paper, we study the problem of hybrid language modeling, that is using models which can predict both characters and larger units such as character ngrams or words. Using such models, multiple potential segmentations usually exist for a given string, for example one using words and one using characters only. Thus, the probability of a string is the sum of the probabilities of all the possible segmentations. Here, we show how it is possible to marginalize over the segmentations efficiently, in order to compute the true probability of a sequence. We apply our technique on three datasets, comprising seven languages, showing improvements over a strong character level language model.

Related Publications

All Publications

Libri-light: A benchmark for ASR with limited or no supervision

Jacob Kahn, Morgan Rivière, Weiyi Zheng, Evgeny Kharitonov, Qiantong Xu, Pierre-Emmanuel Mazaré, Julien Karadayi, Vitaliy Liptchinsky, Ronan Collobert, Christian Fuegen, Tatiana Likhomanenko, Gabriel Synnaeve, Armand Joulin, Abdelrahman Mohamed, Emmanuel Dupoux

ICASSP - May 4, 2020

Spatial Attention for Far-Field Speech Recognition with Deep Beamforming Neural Networks

Weipeng He, Lu Lu, Biqiao Zhang, Jay Mahadeokar, Kaustubh Kalgaonkar, Christian Fuegen

ICASSP - May 8, 2020

An Empirical Study of Transformer-Based Neural Language Model Adaptation

Ke Li, Zhe Liu, Tianxiao Shen, Hongzhao Huang, Fuchun Peng, Daniel Povey, Sanjeev Khudanpur

ICASSP - May 9, 2020

To help personalize content, tailor and measure ads, and provide a safer experience, we use cookies. By clicking or navigating the site, you agree to allow our collection of information on and off Facebook through cookies. Learn more, including about available controls: Cookies Policy