Research Area
Year Published

258 Results

November 1, 2019

Learning to Speak and Act in a Fantasy Text Adventure Game

Conference on Empirical Methods in Natural Language Processing (EMNLP)

We introduce a large-scale crowdsourced text adventure game as a research platform for studying grounded dialogue. In it, agents can perceive, emote, and act whilst conducting dialogue with other agents. Models and humans can both act as characters within the game. We describe the results of training state-of-the-art generative and retrieval models in this setting.

By: Jack Urbanek, Angela Fan, Siddharth Karamcheti, Saachi Jain, Samuel Humeau, Emily Dinan, Tim Rocktäschel, Douwe Kiela, Arthur Szlam, Jason Weston

October 31, 2019

Taking a HINT: Leveraging Explanations to Make Vision and Language Models More Grounded

International Conference on Computer Vision (ICCV)

Many vision and language models suffer from poor visual grounding – often falling back on easy-to-learn language priors rather than basing their decisions on visual concepts in the image. In this work, we propose a generic approach called Human Importance-aware Network Tuning (HINT) that effectively leverages human demonstrations to improve visual grounding.

By: Ramprasaath R. Selvaraju, Stefan Lee, Yilin Shen, Hongxia Jin, Shalini Ghosh, Larry Heck, Dhruv Batra, Devi Parikh

October 30, 2019

Data-efficient Co-Adaptation of Morphology and Behaviour with Deep Reinforcement Learning

Conference on Robot Learning (CoRL)

Humans and animals are capable of quickly learning new behaviours to solve new tasks. Yet, we often forget that they also rely on a highly specialized morphology that co-adapted with motor control throughout thousands of years. Although compelling, the idea of co-adapting morphology and behaviours in robots is often unfeasible because of the long manufacturing times, and the need to redesign an appropriate controller for each morphology. In this paper, we propose a novel approach to automatically and efficiently co-adapt a robot morphology and its controller.

By: Kevin Sebastian Luck, Heni Ben Amor, Roberto Calandra

October 29, 2019

Talking With Hands 16.2M: A Large-Scale Dataset of Synchronized Body-Finger Motion and Audio for Conversational Motion Analysis and Synthesis

International Conference on Computer Vision (ICCV)

We present a 16.2 million frame (50 hour) multimodal dataset of two-person face-to-face spontaneous conversations. Our dataset features synchronized body and finger motion as well as audio data. To the best of our knowledge, it represents the largest motion capture and audio dataset of natural conversations to date.

By: Gilwoo Lee, Zhiwei Deng, Shugao Ma, Takaaki Shiratori, Siddhartha S. Srinivasa, Yaser Sheikh

October 28, 2019

Unsupervised Pre-Training of Image Features on Non-Curated Data

International Conference on Computer Vision (ICCV)

Pre-training general-purpose visual features with convolutional neural networks without relying on annotations is a challenging and important task. Most recent efforts in unsupervised feature learning have focused on either small or highly curated datasets like ImageNet, whereas using non-curated raw datasets was found to decrease the feature quality when evaluated on a transfer task. Our goal is to bridge the performance gap between unsupervised methods trained on curated data, which are costly to obtain, and massive raw datasets that are easily available.

By: Mathilde Caron, Piotr Bojanowski, Julien Mairal, Armand Joulin

October 28, 2019

SlowFast Networks for Video Recognition

International Conference on Computer Vision (ICCV)

We present SlowFast networks for video recognition. Our model involves (i) a Slow pathway, operating at low frame rate, to capture spatial semantics, and (ii) a Fast pathway, operating at high frame rate, to capture motion at fine temporal resolution.

By: Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, Kaiming He

October 27, 2019

Order-Aware Generative Modeling Using the 3D-Craft Dataset

International Conference on Computer Vision (ICCV)

We introduce 3D-Craft, a new dataset of 2,500 Minecraft houses each built by human players sequentially from scratch. To learn from these human action sequences, we propose an order-aware 3D generative model called VoxelCNN.

By: Zhuoyuan Chen, Demi Guo, Tong Xiao, Saining Xie, Xinlei Chen, Haonan Yu, Jonathan Gray, Kavya Srinet, Haoqi Fan, Jerry Ma, Charles R. Qi, Shubham Tulsiani, Arthur Szlam, Larry Zitnick

October 27, 2019

Transferability and Hardness of Supervised Classification Tasks

International Conference on Computer Vision (ICCV)

We propose a novel approach for estimating the difficulty and transferability of supervised classification tasks. Unlike previous work, our approach is solution agnostic and does not require or assume trained models. Instead, we estimate these values using an information theoretic approach: treating training labels as random variables and exploring their statistics.

By: Anh T. Tran, Cuong V. Nguyen, Tal Hassner

October 27, 2019

Video Classification with Channel-Separated Convolutional Networks

International Conference on Computer Vision (ICCV)

This paper studies the effects of different design choices in 3D group convolutional networks for video classification. We empirically demonstrate that the amount of channel interactions plays an important role in the accuracy of 3D group convolutional networks.

By: Du Tran, Heng Wang, Lorenzo Torresani, Matt Feiszli

October 27, 2019

Improved Conditional VRNNs for Video Prediction

International Conference on Computer Vision (ICCV)

Predicting future frames for a video sequence is a challenging generative modeling task. Promising approaches include probabilistic latent variable models such as the Variational Auto-Encoder. While VAEs can handle uncertainty and model multiple possible future outcomes, they have a tendency to produce blurry predictions. In this work we argue that this is a sign of underfitting.

By: Lluís Castrejón, Nicolas Ballas, Aaron Courville