December 4, 2017

One-Sided Unsupervised Domain Mapping

Neural Information Processing Systems (NIPS)

In this work, we present a method of learning GAB without learning GBA. This is done by learning a mapping that maintains the distance between a pair of samples.

Sagie Benaim, Lior Wolf
October 22, 2017

Mask R-CNN

International Conference on Computer Vision (ICCV)

We present a conceptually simple, flexible, and general framework for object instance segmentation. Our approach efficiently detects objects in an image while simultaneously generating a high-quality segmentation mask for each instance. The method, called Mask R-CNN, extends Faster R-CNN by adding a branch for predicting an object mask in parallel with the existing branch for bounding box recognition.

Kaiming He, Georgia Gkioxari, Piotr Dollar, Ross Girshick
October 22, 2017

Transitive Invariance for Self-supervised Visual Representation Learning

International Conference on Computer Vision (ICCV)

In this paper, we propose to exploit different self-supervised approaches to learn representations invariant to (i) inter-instance variations (two objects in the same class should have similar features) and (ii) intra-instance variations (viewpoint, pose, deformations, illumination, etc.).

Xiaolong Wang, Kaiming He, Abhinav Gupta
October 22, 2017

Unsupervised Creation of Parameterized Avatars

International Conference on Computer Vision (ICCV)

We study the problem of mapping an input image to a tied pair consisting of a vector of parameters and an image that is created using a graphical engine from the vector of parameters. The mapping’s objective is to have the output image as similar as possible to the input image.

Lior Wolf, Yaniv Taigman, Adam Polyak
October 22, 2017

Focal Loss for Dense Object Detection

International Conference on Computer Vision (ICCV)

In this paper, we investigate why one-stage detectors that are applied over a regular, dense sampling of possible object locations have the potential to be faster and simpler, but have trailed the accuracy of two-stage detectors thus far. We design and train a simple dense detector we call RetinaNet. Our results show that when trained with the focal loss, RetinaNet is able to match the speed of previous one-stage detectors while surpassing the accuracy of all existing state-of-the-art two-stage detectors.

Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, Piotr Dollar
October 22, 2017

Low-shot Visual Recognition by Shrinking and Hallucinating Features

International Conference on Computer Vision (ICCV)

Low-shot visual learning—the ability to recognize novel object categories from very few examples—is a hallmark of human visual intelligence. Existing machine learning approaches fail to generalize in the same way. We present a lowshot learning benchmark on complex images that mimics challenges faced by recognition systems in the wild.

Bharath Hariharan, Ross Girshick
October 22, 2017

Dense and Low-Rank Gaussian CRFs using Deep Embeddings

International Conference on Computer Vision (ICCV)

In this work we introduce a structured prediction model that endows the Deep Gaussian Conditional Random Field (G-CRF) with a densely connected graph structure.

Siddhartha Chandra, Nicolas Usunier, Iasonas Kokkinos
October 22, 2017

Inferring and Executing Programs for Visual Reasoning

International Conference on Computer Vision (ICCV)

Inspired by module networks, this paper proposes a model for visual reasoning that consists of a program generator that constructs an explicit representation of the reasoning process to be performed, and an execution engine that executes the resulting program to produce an answer.

Justin Johnson, Bharath Hariharan, Laurens van der Maaten, Judy Hoffman, Li Fei-Fei, Larry Zitnick, Ross Girshick
October 22, 2017

Predicting Deeper into the Future of Semantic Segmentation

International Conference on Computer Vision (ICCV)

The ability to predict and therefore to anticipate the future is an important attribute of intelligence. We introduce the novel task of predicting semantic segmentations of future frames. Given a sequence of video frames, our goal is to predict segmentation maps of not yet observed video frames that lie up to a second or further in the future.

Pauline Luc, Natalia Neverova, Camille Couprie, Jakob Verbeek, Yann LeCun
October 22, 2017

Deltille Grids for Geometric Camera Calibration

International Conference on Computer Vision (ICCV)

The recent proliferation of high resolution cameras presents an opportunity to achieve unprecedented levels of precision in visual 3D reconstruction. Yet the camera calibration pipeline, developed decades ago using checkerboards, has remained the de facto standard. In this paper, we ask the question: are checkerboards the optimal pattern for high precision calibration?

Hyowon Ha, Michal Perdoch, Hatem Alismail, In So Kweon, Yaser Sheikh
September 7, 2017

Natural Language Does Not Emerge ‘Naturally’ in Multi-Agent Dialog

Conference on Empirical Methods in Natural Language Processing (EMNLP)

In this paper, using a Task & Talk reference game between two agents as a testbed, we present a sequence of ‘negative’ results culminating in a ‘positive’ one – showing that while most agent-invented languages are effective (i.e. achieve near-perfect task rewards), they are decidedly not interpretable or compositional.

Satwik Kottur, José M.F. Moura, Stefan Lee, Dhruv Batra
July 30, 2017

Low-Cost 360 Stereo Photography and Video Capture


In this work, we describe a method that takes images from two 360◦ spherical cameras and synthesizes an omni-directional stereo panorama with stereo in all directions. Our proposed method has a lower equipment cost than camera-ring alternatives, can be assembled with currently available off-the-shelf equipment, and is relatively small and light-weight compared to the alternatives.

Kevin Matzen, Michael Cohen, Bryce Evans, Johannes Kopf, Richard Szeliski
July 22, 2017

Densely Connected Convolutional Networks

CVPR 2017

In this paper, we embrace the observation that hat convolutional networks can be substantially deeper, more accurate, and efficient to train if they contain shorter connections between layers close to the input and those close to the output, and introduce the Dense Convolutional Network (DenseNet), which connects each layer to every other layer in a feed-forward fashion.

Gao Huang, Zhuang Liu, Laurens van der Maaten, Kilian Q. Weinberger
July 21, 2017

Relationship Proposal Networks

Conference on Computer Vision and Pattern Recognition 2017

In this paper we address the challenges of image scene object recognition by using pairs of related regions in images to train a relationship proposer that at test time produces a manageable number of related regions.

Ahmed Elgammal, Ji Zhang, Mohamed Elhoseiny, Scott Cohen, Walter Chang
July 21, 2017

Link the head to the “beak”: Zero Shot Learning from Noisy Text Description at Part Precision

CVPR 2017

In this paper, we study learning visual classifiers from unstructured text descriptions at part precision with no training images. We propose a learning framework that is able to connect text terms to its relevant parts and suppress connections to non-visual text terms without any part-text annotations. F

Mohamed Elhoseiny, Yizhe Zhu, Han Zhang, Ahmed Elgammal
July 21, 2017

Learning Features by Watching Objects Move

CVPR 2017

This paper presents a novel yet intuitive approach to unsupervised feature learning. Inspired by the human visual system, we explore whether low-level motion-based grouping cues can be used to learn an effective visual representation.

Deepak Pathak, Ross Girshick, Piotr Dollar, Trevor Darrell, Bharath Hariharan
July 21, 2017

Feature Pyramid Networks for Object Detection

CVPR 2017

In this paper, we exploit the inherent multi-scale, pyramidal hierarchy of deep convolutional networks to construct feature pyramids with marginal extra cost.

Tsung-Yi Lin, Piotr Dollar, Ross Girshick, Kaiming He, Bharath Hariharan, Serge Belongie
July 21, 2017

Semantic Amodal Segmentation

CVPR 2017

Common visual recognition tasks such as classification, object detection, and semantic segmentation are rapidly reaching maturity, and given the recent rate of progress, it is not unreasonable to conjecture that techniques for many of these problems will approach human levels of performance in the next few years. In this paper we look to the future: what is the next frontier in visual recognition?

Yan Zhu, Yuandong Tian, Dimitris Mexatas, Piotr Dollar
July 21, 2017

Aggregated Residual Transformations for Deep Neural Networks

CVPR 2017

We present a simple, highly modularized network architecture for image classification.

Saining Xie, Ross Girshick, Piotr Dollar, Zhuowen Tu, Kaiming He
June 8, 2017

Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour

Data @ Scale

In this paper, we empirically show that on the ImageNet dataset large minibatches cause optimization difficulties, but when these are addressed the trained networks exhibit good generalization.

Priya Goyal, Piotr Dollar, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, Kaiming He