Research Area
Year Published

203 Results

October 27, 2019

NoCaps: Novel object captioning at scale

International Conference on Computer Vision (ICCV)

Image captioning models have achieved impressive results on datasets containing limited visual concepts and large amounts of paired image-caption training data. However, if these models are to ever function in the wild, a much larger variety of visual concepts must be learned, ideally from less supervision. To encourage the development of image captioning models that can learn visual concepts from alternative data sources, such as object detection datasets, we present the first large-scale benchmark for this task.

By: Harsh Agrawal, Karan Desai, Yufei Wang, Xinlei Chen, Rishabh Jain, Mark Johnson, Dhruv Batra, Devi Parikh, Stefan Lee, Peter Anderson

October 27, 2019

SCSampler: Sampling Salient Clips from Video for Efficient Action Recognition

International Conference on Computer Vision (ICCV)

In this paper we introduce a lightweight “clip-sampling” model that can efficiently identify the most salient temporal clips within a long video. We demonstrate that the computational cost of action recognition on untrimmed videos can be dramatically reduced by invoking recognition only on these most salient clips. Furthermore, we show that this yields significant gains in recognition accuracy compared to analysis of all clips or randomly/uniformly selected clips.

By: Bruno Korbar, Du Tran, Lorenzo Torresani

October 27, 2019

Live Face De-Identification in Video

International Conference on Computer Vision (ICCV)

We propose a method for face de-identification that enables fully automatic video modification at high frame rates. The goal is to maximally decorrelate the identity, while having the perception (pose, illumination and expression) fixed.

By: Oran Gafni, Lior Wolf, Yaniv Taigman

October 27, 2019

Scaling and Benchmarking Self-Supervised Visual Representation Learning

International Conference on Computer Vision (ICCV)

Self-supervised learning aims to learn representations from the data itself without explicit manual supervision. Existing efforts ignore a crucial aspect of self-supervised learning – the ability to scale to large amount of data because self-supervision requires no manual labels. In this work, we revisit this principle and scale two popular self-supervised approaches to 100 million images.

By: Priya Goyal, Dhruv Mahajan, Abhinav Gupta, Ishan Misra

October 27, 2019

IMP: Instance Mask Projection for High Accuracy Semantic Segmentation of Things

International Conference on Computer Vision (ICCV)

In this work, we present a new operator, called Instance Mask Projection (IMP), which projects a predicted Instance Segmentation as a new feature for semantic segmentation. It also supports back propagation so is trainable end-to-end. By adding this operator, we introduce a new paradigm which combines top-down and bottom-up information in semantic segmentation.

By: Cheng-Yang Fu, Tamara Berg, Alexander C. Berg

October 27, 2019

Habitat: A Platform for Embodied AI Research

International Conference on Computer Vision (ICCV)

We present Habitat, a platform for research in embodied artificial intelligence (AI). Habitat enables training embodied agents (virtual robots) in highly efficient photorealistic 3D simulation.

By: Manolis Savva, Abhishek Kadian, Oleksandr Maksymets, Yili Zhao, Erik Wijmans, Bhavana Jain, Julian Straub, Jia Liu, Vladlen Koltun, Jitendra Malik, Devi Parikh, Dhruv Batra

October 26, 2019

Drop an Octave: Reducing Spatial Redundancy in Convolutional Neural Networks with Octave Convolution

International Conference on Computer Vision (ICCV)

In natural images, information is conveyed at different frequencies where higher frequencies are usually encoded with fine details and lower frequencies are usually encoded with global structures. Similarly, the output feature maps of a convolution layer can also be seen as a mixture of information at different frequencies. In this work, we propose to factorize the mixed feature maps by their frequencies, and design a novel Octave Convolution (OctConv) operation to store and process feature maps that vary spatially “slower” at a lower spatial resolution reducing both memory and computation cost.

By: Yunpeng Chen, Haoqi Fan, Bing Xu, Zhicheng Yan, Yannis Kalantidis, Marcus Rohrbach, Shuicheng Yan, Jiashi Feng

October 26, 2019

Co-Separating Sounds of Visual Objects

International Conference on Computer Vision (ICCV)

Learning how objects sound from video is challenging, since they often heavily overlap in a single audio channel. Current methods for visually-guided audio source separation sidestep the issue by training with artificially mixed video clips, but this puts unwieldy restrictions on training data collection and may even prevent learning the properties of “true” mixed sounds. We introduce a co-separation training paradigm that permits learning object-level sounds from unlabeled multi-source videos.

By: Ruohan Gao, Kristen Grauman

October 26, 2019

Grounded Human-Object Interaction Hotspots From Video

International Conference on Computer Vision (ICCV)

Learning how to interact with objects is an important step towards embodied visual intelligence, but existing techniques suffer from heavy supervision or sensing requirements. We propose an approach to learn human-object interaction “hotspots” directly from video.

By: Tushar Nagarajan, Christoph Feichtenhofer, Kristen Grauman

October 26, 2019

On Network Design Spaces for Visual Recognition

International Conference on Computer Vision (ICCV)

Over the past several years progress in designing better neural network architectures for visual recognition has been substantial. To help sustain this rate of progress, in this work we propose to reexamine the methodology for comparing network architectures. In particular, we introduce a new comparison paradigm of distribution estimates, in which network design spaces are compared by applying statistical techniques to populations of sampled models, while controlling for confounding factors like network complexity.

By: Ilija Radosavovic, Justin Johnson, Saining Xie, Wan-Yen Lo, Piotr Dollar