Research Area
Year Published

160 Results

January 13, 2020

Scaling up online speech recognition using ConvNets

arXiv

We design an online end-to-end speech recognition system based on Time-Depth Separable (TDS) convolutions and Connectionist Temporal Classification (CTC). The system has almost three times the throughput of a well tuned hybrid ASR baseline while also having lower latency and a better word error rate.

By: Vineel Pratap, Qiantong Xu, Jacob Kahn, Gilad Avidov, Tatiana Likhomanenko, Awni Hannun, Vitaliy Liptchinsky, Gabriel Synnaeve, Ronan Collobert

December 14, 2019

From Senones to Chenones: Tied Context-Dependent Graphemes for Hybrid Speech Recognition

IEEE Automatic Speech Recognition and Understanding Workshop

There is an implicit assumption that traditional hybrid approaches for automatic speech recognition (ASR) cannot directly model graphemes and need to rely on phonetic lexicons to get competitive performance, especially on English which has poor grapheme-phoneme correspondence. In this work, we show for the first time that, on English, hybrid ASR systems can in fact model graphemes effectively by leveraging tied context-dependent graphemes, i.e., chenones.

By: Duc Le, Xiaohui Zhang, Weiyi Zhang, Christian Fuegen, Geoffrey Zweig, Michael L. Seltzer

December 13, 2019

Lead2Gold: Towards exploiting the full potential of noisy transcriptions for speech recognition

Automatic Speech Recognition and Understanding Workshop

The transcriptions used to train an Automatic Speech Recognition (ASR) system may contain errors. Usually, either a quality control stage discards transcriptions with too many errors, or the noisy transcriptions are used as is. We introduce Lead2Gold, a method to train an ASR system that exploits the full potential of noisy transcriptions.

By: Adrien Dufraux, Emmanuel Vincent, Awni Hannun, Armelle Brun, Matthijs Douze

December 12, 2019

Compositional generalization through meta sequence-to-sequence learning

Neural Information Processing Systems (NeurIPS)

People can learn a new concept and use it compositionally, understanding how to “blicket twice” after learning how to “blicket.” In contrast, powerful sequence-to-sequence (seq2seq) neural networks fail such tests of compositionality, especially when composing new concepts together with existing concepts. In this paper, I show how memory-augmented neural networks can be trained to generalize compositionally through meta seq2seq learning.

By: Brenden Lake

December 9, 2019

ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks

Neural Information Processing Systems (NeurIPS)

We present ViLBERT (short for Vision-and-Language BERT), a model for learning task-agnostic joint representations of image content and natural language. We extend the popular BERT architecture to a multi-modal two-stream model, processing both visual and textual inputs in separate streams that interact through co-attentional transformer layers.

By: Jiasen Lu, Dhruv Batra, Devi Parikh, Stefan Lee

December 8, 2019

Levenshtein Transformer

Neural Information Processing Systems (NeurIPS)

Modern neural sequence generation models are built to either generate tokens step-by-step from scratch or (iteratively) modify a sequence of tokens bounded by a fixed length. In this work, we develop Levenshtein Transformer, a new partially autoregressive model devised for more flexible and amenable sequence generation. Unlike previous approaches, the basic operations of our model are insertion and deletion. The combination of them facilitates not only generation but also sequence refinement allowing dynamic length changes.

By: Jiatao Gu, Changhan Wang, Jake (Junbo) Zhao

December 8, 2019

RUBi: Reducing Unimodal Biases for Visual Question Answering

Neural Information Processing Systems (NeurIPS)

Visual Question Answering (VQA) is the task of answering questions about an image. Some VQA models often exploit unimodal biases to provide the correct answer without using the image information. As a result, they suffer from a huge drop in performance when evaluated on data outside their training set distribution. This critical issue makes them unsuitable for real-world settings. We propose RUBi, a new learning strategy to reduce biases in any VQA model. It reduces the importance of the most biased examples, i.e. examples that can be correctly classified without looking at the image.

By: Remi Cadene, Corentin Dancette, Hedi Ben-younes, Matthieu Cord, Devi Parikh

December 2, 2019

Hierarchical Decision Making by Generating and Following Natural Language Instructions

Neural Information Processing Systems (NeurIPS)

We explore using latent natural language instructions as an expressive and compositional representation of complex actions for hierarchical decision making. Rather than directly selecting micro-actions, our agent first generates a latent plan in natural language, which is then executed by a separate model. We introduce a challenging real-time strategy game environment in which the actions of a large number of units must be coordinated across long time scales.

By: Hengyuan Hu, Denis Yarats, Qucheng Gong, Yuandong Tian, Mike Lewis

December 1, 2019

SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems

Neural Information Processing Systems (NeurIPS)

In this paper we present SuperGLUE, a new benchmark styled after GLUE with a new set of more difficult language understanding tasks, a software toolkit, and a public leaderboard.

By: Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, Samuel R. Bowman

November 25, 2019

Findings of the First Shared Task on Machine Translation Robustness

Conference on Machine Learning (WMT)

We share the findings of the first shared task on improving robustness of Machine Translation (MT). The task provides a testbed representing challenges facing MT models deployed in the real world, and facilitates new approaches to improve models’ robustness to noisy input and domain mismatch. We focus on two language pairs (English-French and English-Japanese), and the submitted systems are evaluated on a blind test set consisting of noisy comments on Reddit and professionally sourced translations.

By: Xian Li, Paul Michel, Antonios Anastasopoulos, Yonatan Belinkov, Nadir Durrani, Orhan Firat, Philipp Koehn, Graham Neubig, Juan Pino, Hassan Sajjad