Publication

Scaling up online speech recognition using ConvNets

arXiv


Abstract

We design an online end-to-end speech recognition system based on Time-Depth Separable (TDS) convolutions and Connectionist Temporal Classification (CTC). The system has almost three times the throughput of a well tuned hybrid ASR baseline while also having lower latency and a better word error rate. We improve the core TDS architecture in order to limit the future context and hence reduce latency while maintaining accuracy. Also important to the efficiency of the recognizer is our highly optimized beam search decoder. To show the impact of our design choices, we analyze throughput, latency and accuracy and also discuss how these metrics can be tuned based on the user requirements.

Related Publications

All Publications

Libri-light: A benchmark for ASR with limited or no supervision

Jacob Kahn, Morgan Rivière, Weiyi Zheng, Evgeny Kharitonov, Qiantong Xu, Pierre-Emmanuel Mazaré, Julien Karadayi, Vitaliy Liptchinsky, Ronan Collobert, Christian Fuegen, Tatiana Likhomanenko, Gabriel Synnaeve, Armand Joulin, Abdelrahman Mohamed, Emmanuel Dupoux

ICASSP - May 4, 2020

Spatial Attention for Far-Field Speech Recognition with Deep Beamforming Neural Networks

Weipeng He, Lu Lu, Biqiao Zhang, Jay Mahadeokar, Kaustubh Kalgaonkar, Christian Fuegen

ICASSP - May 8, 2020

An Empirical Study of Transformer-Based Neural Language Model Adaptation

Ke Li, Zhe Liu, Tianxiao Shen, Hongzhao Huang, Fuchun Peng, Daniel Povey, Sanjeev Khudanpur

ICASSP - May 9, 2020

To help personalize content, tailor and measure ads, and provide a safer experience, we use cookies. By clicking or navigating the site, you agree to allow our collection of information on and off Facebook through cookies. Learn more, including about available controls: Cookies Policy