Research Area
Year Published

164 Results

October 26, 2019

On Network Design Spaces for Visual Recognition

International Conference on Computer Vision (ICCV)

Over the past several years progress in designing better neural network architectures for visual recognition has been substantial. To help sustain this rate of progress, in this work we propose to reexamine the methodology for comparing network architectures. In particular, we introduce a new comparison paradigm of distribution estimates, in which network design spaces are compared by applying statistical techniques to populations of sampled models, while controlling for confounding factors like network complexity.

By: Ilija Radosavovic, Justin Johnson, Saining Xie, Wan-Yen Lo, Piotr Dollar

October 26, 2019

Drop an Octave: Reducing Spatial Redundancy in Convolutional Neural Networks with Octave Convolution

International Conference on Computer Vision (ICCV)

In natural images, information is conveyed at different frequencies where higher frequencies are usually encoded with fine details and lower frequencies are usually encoded with global structures. Similarly, the output feature maps of a convolution layer can also be seen as a mixture of information at different frequencies. In this work, we propose to factorize the mixed feature maps by their frequencies, and design a novel Octave Convolution (OctConv) operation to store and process feature maps that vary spatially “slower” at a lower spatial resolution reducing both memory and computation cost.

By: Yunpeng Chen, Haoqi Fan, Bing Xu, Zhicheng Yan, Yannis Kalantidis, Marcus Rohrbach, Shuicheng Yan, Jiashi Feng

October 26, 2019

Co-Separating Sounds of Visual Objects

International Conference on Computer Vision (ICCV)

Learning how objects sound from video is challenging, since they often heavily overlap in a single audio channel. Current methods for visually-guided audio source separation sidestep the issue by training with artificially mixed video clips, but this puts unwieldy restrictions on training data collection and may even prevent learning the properties of “true” mixed sounds. We introduce a co-separation training paradigm that permits learning object-level sounds from unlabeled multi-source videos.

By: Ruohan Gao, Kristen Grauman

October 26, 2019

Grounded Human-Object Interaction Hotspots From Video

International Conference on Computer Vision (ICCV)

Learning how to interact with objects is an important step towards embodied visual intelligence, but existing techniques suffer from heavy supervision or sensing requirements. We propose an approach to learn human-object interaction “hotspots” directly from video.

By: Tushar Nagarajan, Christoph Feichtenhofer, Kristen Grauman

October 25, 2019

Fashion++: Minimal Edits for Outfit Improvement

International Conference on Computer Vision (ICCV)

Given an outfit, what small changes would most improve its fashionability? This question presents an intriguing new vision challenge. We introduce Fashion++, an approach that proposes minimal adjustments to a full-body clothing outfit that will have maximal impact on its fashionability.

By: Wei-Lin Hsiao, Isay Katsman, Chao-Yuan Wu, Devi Parikh, Kristen Grauman

September 5, 2019

C3DPO: Canonical 3D Pose Networks for Non-Rigid Structure From Motion

International Conference on Computer Vision (ICCV)

We propose C3DPO, a method for extracting 3D models of deformable objects from 2D keypoint annotations in unconstrained images. We do so by learning a deep network that reconstructs a 3D object from a single view at a time, accounting for partial occlusions, and explicitly factoring the effects of viewpoint changes and object deformations.

By: David Novotny, Nikhila Ravi, Benjamin Graham, Natalia Neverova, Andrea Vedaldi

August 12, 2019

Efficient Segmentation: Learning Downsampling Near Semantic Boundaries

International Conference on Computer Vision (ICCV)

Many automated processes such as auto-piloting rely on a good semantic segmentation as a critical component. To speed up performance, it is common to downsample the input frame. However, this comes at the cost of missed small objects and reduced accuracy at semantic boundaries. To address this problem, we propose a new content-adaptive downsampling technique that learns to favor sampling locations near semantic boundaries of target classes.

By: Dmitrii Marin, Zijian He, Peter Vajda, Priyam Chatterjee, Sam Tsai, Fei Yang, Yuri Boykov

August 4, 2019

MSURU: Large Scale E-commerce Image Classification With Weakly Supervised Search Data

Conference on Knowledge Discovery and Data Mining (KDD)

In this paper we present a deployed image recognition system used in a large scale commerce search engine, which we call MSURU. It is designed to process product images uploaded daily to Facebook Marketplace. Social commerce is a growing area within Facebook and understanding visual representations of product content is important for search and recommendation applications on Marketplace.

By: Yina Tang, Fedor Borisyuk, Siddarth Malreddy, Yixuan Li, Yiqun Liu, Sergey Kirshner

July 31, 2019

Neural Volumes: Learning Dynamic Renderable Volumes from Images


To overcome memory limitations of voxel-based representations, we learn a dynamic irregular grid structure implemented with a warp field during ray-marching. This structure greatly improves the apparent resolution and reduces grid-like artifacts and jagged motion. Finally, we demonstrate how to incorporate surface-based representations into our volumetric-learning framework for applications where the highest resolution is required, using facial performance capture as a case in point.

By: Stephen Lombardi, Tomas Simon, Jason Saragih, Gabriel Schwartz, Andreas Lehrmann, Yaser Sheikh

July 14, 2019

Extreme Relative Pose Estimation for RGB-D Scans via Scene Completion

Conference on Computer Vision and Pattern Recognition (CVPR)

Estimating the relative rigid pose between two RGB-D scans of the same underlying environment is a fundamental problem in computer vision, robotics, and computer graphics. Most existing approaches allow only limited relative pose changes since they require considerable overlap between the input scans. We introduce a novel approach that extends the scope to extreme relative poses, with little or even no overlap between the input scans.

By: Zhenpei Yang, Jeffrey Z. Pan, Linjie Luo, Xiaowei Zhou, Kristen Grauman, Qixing Huang