Research Area
Year Published

203 Results

December 8, 2019

RUBi: Reducing Unimodal Biases for Visual Question Answering

Neural Information Processing Systems (NeurIPS)

Visual Question Answering (VQA) is the task of answering questions about an image. Some VQA models often exploit unimodal biases to provide the correct answer without using the image information. As a result, they suffer from a huge drop in performance when evaluated on data outside their training set distribution. This critical issue makes them unsuitable for real-world settings. We propose RUBi, a new learning strategy to reduce biases in any VQA model. It reduces the importance of the most biased examples, i.e. examples that can be correctly classified without looking at the image.

By: Remi Cadene, Corentin Dancette, Hedi Ben-younes, Matthieu Cord, Devi Parikh

December 8, 2019

Cross-channel Communication Networks

Neural Information Processing Systems (NeurIPS)

Convolutional neural networks process input data by sending channel-wise feature response maps to subsequent layers. While a lot of progress has been made by making networks deeper, information from each channel can only be propagated from lower levels to higher levels in a hierarchical feed-forward manner. When viewing each filter in the convolutional layer as a neuron, those neurons are not communicating explicitly within each layer in CNNs. We introduce a novel network unit called Cross-channel Communication (C3) block, a simple yet effective module to encourage the neuron communication within the same layer.

By: Jianwei Yang, Zhile Ren, Hongyuan Zhu, Ji Lin, Chuang Gan, Devi Parikh

December 7, 2019

Unsupervised Object Segmentation by Redrawing

Neural Information Processing Systems (NeurIPS)

Object segmentation is a crucial problem that is usually solved by using supervised learning approaches over very large datasets composed of both images and corresponding object masks. Since the masks have to be provided at pixel level, building such a dataset for any new domain can be very time-consuming. We present ReDO, a new model able to extract objects from images without any annotation in an unsupervised way.

By: Mickaël Chen, Thierry Artières, Ludovic Denoyer

November 25, 2019

Third-Person Visual Imitation Learning via Decoupled Hierarchical Controller

Neural Information Processing Systems (NeurIPS)

We study a generalized setup for learning from demonstration to build an agent that can manipulate novel objects in unseen scenarios by looking at only a single video of human demonstration from a third-person perspective. To accomplish this goal, our agent should not only learn to understand the intent of the demonstrated third-person video in its context but also perform the intended task in its environment configuration.

By: Pratyusha Sharma, Deepak Pathak, Abhinav Gupta

November 17, 2019

Correlated Uncertainty for Learning Dense Correspondences from Noisy Labels

Neural Information Processing Systems (NeurIPS)

Many machine learning methods depend on human supervision to achieve optimal performance. However, in tasks such as DensePose, where the goal is to establish dense visual correspondences between images, the quality of manual annotations is intrinsically limited. We address this issue by augmenting neural network predictors with the ability to output a distribution over labels, thus explicitly and introspectively capturing the aleatoric uncertainty in the annotations.

By: Natalia Neverova, David Novotny, Andrea Vedaldi

November 10, 2019

An Integrated 6DoF Video Camera and System Design

SIGGRAPH Asia

Designing a fully integrated 360◦ video camera supporting 6DoF head motion parallax requires overcoming many technical hurdles, including camera placement, optical design, sensor resolution, system calibration, real-time video capture, depth reconstruction, and real-time novel view synthesis. While there is a large body of work describing various system components, such as multi-view depth estimation, our paper is the first to describe a complete, reproducible system that considers the challenges arising when designing, building, and deploying a full end-to-end 6DoF video camera and playback environment.

By: Albert Parra Pozo, Michael Toksvig, Terry Filiba Schrager, Joyce Hsu, Uday Mathur, Alexander Sorkine-Hornung, Richard Szeliski, Brian Cabral

November 3, 2019

Improving Generative Visual Dialog by Answering Diverse Questions

Conference on Empirical Methods in Natural Language Processing (EMNLP)

Prior work on training generative Visual Dialog models with reinforcement learning (Das et al., 2017b) has explored a Q-BOT-A-BOT image-guessing game and shown that this ‘self-talk’ approach can lead to improved performance at the downstream dialog-conditioned image-guessing task. However, this improvement saturates and starts degrading after a few rounds of interaction, and does not lead to a better Visual Dialog model.

By: Vishvak Murahari, Prithvijit Chattopadhyay, Dhruv Batra, Devi Parikh, Abhishek Das

October 31, 2019

Taking a HINT: Leveraging Explanations to Make Vision and Language Models More Grounded

International Conference on Computer Vision (ICCV)

Many vision and language models suffer from poor visual grounding – often falling back on easy-to-learn language priors rather than basing their decisions on visual concepts in the image. In this work, we propose a generic approach called Human Importance-aware Network Tuning (HINT) that effectively leverages human demonstrations to improve visual grounding.

By: Ramprasaath R. Selvaraju, Stefan Lee, Yilin Shen, Hongxia Jin, Shalini Ghosh, Larry Heck, Dhruv Batra, Devi Parikh

October 29, 2019

Talking With Hands 16.2M: A Large-Scale Dataset of Synchronized Body-Finger Motion and Audio for Conversational Motion Analysis and Synthesis

International Conference on Computer Vision (ICCV)

We present a 16.2 million frame (50 hour) multimodal dataset of two-person face-to-face spontaneous conversations. Our dataset features synchronized body and finger motion as well as audio data. To the best of our knowledge, it represents the largest motion capture and audio dataset of natural conversations to date.

By: Gilwoo Lee, Zhiwei Deng, Shugao Ma, Takaaki Shiratori, Siddhartha S. Srinivasa, Yaser Sheikh

October 28, 2019

SlowFast Networks for Video Recognition

International Conference on Computer Vision (ICCV)

We present SlowFast networks for video recognition. Our model involves (i) a Slow pathway, operating at low frame rate, to capture spatial semantics, and (ii) a Fast pathway, operating at high frame rate, to capture motion at fine temporal resolution.

By: Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, Kaiming He