June 18, 2018

Facebook Research at CVPR 2018

By: Facebook Research

Computer vision experts from around the world are gathering in Salt Lake City, Utah this week for CVPR 2018 to present the latest advances in computer vision and pattern recognition. Research from Facebook will be presented in oral spotlight presentations and group poster sessions. Our researchers and engineers will also be organizing and participating in numerous workshops, tutorials, and panels throughout the week, including the fourth annual Women in Computer Vision (WiCV) workshop.

VQA Challenge Winners

We are pleased to announce that a team of engineers and researchers from Facebook AI Research have won this year’s Visual Question Answering (VQA) challenge. The team members are Tina Jiang, Vivek Natarajan, Xinlei Chen, Marcus RohrbachDhruv Batra and Devi Parikh (with the first three jointly leading the effort). The team calls itself FAIR A-STAR (Agents that See, Talk, Act and Reason). Congratulations, FAIR A-STAR!

Join us LIVE from CVPR 2018

For those who can’t make it to CVPR 2018, we’ll be hosting a Facebook LIVE and streaming all CVPR oral spotlight presentations and the WiCV workshop from the CVPR Facebook page, starting Tuesday at 8:30 am MDT. Be sure to tune in.

Facebook research being presented at CVPR 2018:

3D Semantic Segmentation with Submanifold Sparse Convolutional Networks
Benjamin Graham, Martin Engelcke, Laurens van der Maaten

Convolutional networks are the de-facto standard for analyzing spatio-temporal data such as images, videos, and 3D shapes. Whilst some of this data is naturally dense (e.g., photos), many other data sources are inherently sparse. Examples include 3D point clouds that were obtained using a LiDAR scanner or RGB-D camera. Standard “dense” implementations of convolutional networks are very inefficient when applied on such sparse data. We introduce new sparse convolutional operations that are designed to process spatially-sparse data more efficiently, and use them to develop spatially-sparse convolutional networks. We demonstrate the strong performance of the resulting models, called submanifold sparse convolutional networks (SSCNs), on two tasks involving semantic segmentation of 3D point clouds. In particular, our models outperform all prior state-of-the-art on the test set of a recent semantic segmentation competition.

A Closer Look at Spatiotemporal Convolutions for Action Recognition
Du Tran, Heng Wang, Lorenzo Torresani, Jamie RayYann LeCun, Manohar Paluri

In this paper we introduce several new forms of spatiotemporal convolutions for video analysis and study their effects on action recognition. Our motivation stems from the observation that 2D CNNs applied to individual frames of the video have remained frustratingly solid performers in action recognition. In this work we empirically demonstrate the accuracy advantages of 3D CNNs over 2D CNNs within the framework of residual learning. Furthermore, we show that factorizing the 3D convolutional filters into separate spatial and temporal components yields significantly advantages in accuracy. Our empirical study leads to the design of a new form of spatiotemporal convolutional block—2.5D—which gives rise to CNNs that outperform by a large margin the state-of-the-art on the Sports-1M and Kinetics datasets.

A Generative Adversarial Approach for Zero-Shot Learning from Noisy Texts
Yizhe Zhu, Mohamed Elhoseiny, Bingchen Liu, Xi Peng, Ahmed Elgammal

Most existing zero-shot learning methods consider the problem as a visual semantic embedding one. Given the demonstrated capability of Generative Adversarial Networks (GANs) to generate images, we instead leverage GANs to imagine unseen categories from text descriptions and hence recognize novel classes with no examples being seen. Specifically, we propose a simple yet effective generative model that takes as input noisy text descriptions about an unseen class (e.g. Wikipedia articles) and generates synthesized visual features for this class. With added pseudo data, zero-shot learning is naturally converted to a traditional classification problem. Additionally, to preserve the inter-class discrimination of the generated features, a visual pivot regularization is proposed as an explicit supervision.

A Holistic Framework for Addressing the World using Machine Learning
Ilke Demir, Forest Hughes, Aman Raj, Kaunil Dhruv, Suryanarayana Murthy Muddala, Sanyam Garg, Barrett Doo, Ramesh Raskar

Millions of people are disconnected from basic services due to lack of adequate addressing. We propose an automatic generative algorithm to create street addresses from satellite imagery. Our addressing scheme is coherent with the street topology, linear and hierarchical to follow human perception, and universal to be used as a unified geocoding system. Our algorithm starts with extracting road segments using deep learning and partitions the road network into regions. Then regions, streets, and address cells are named using proximity computations. We also extend our addressing scheme to cover inaccessible areas, to be flexible for changes, and to lead as a pioneer for a unified geodatabase.

A Two-Step Disentanglement Method
Naama Hadad, Lior Wolf, Moni Shahar

We address the problem of disentanglement of factors that generate a given data into those that are correlated with the labeling and those that are not. Our solution is simpler than previous solutions and employs adversarial training. First, the part of the data that is correlated with the labels is extracted by training a classifier. Then, the other part is extracted such that it enables the reconstruction of the original data but does not contain label information. The utility of the new method is demonstrated on visual datasets as well as on financial data.

Audio to Body Dynamics
Eli Shlizerman, Lucio Dery, Hayden Schoen, Ira Kemelmacher-Shlizerman

We present a method that gets as input an audio of violin or piano playing, and outputs a video of skeleton predictions which are further used to animate an avatar. The key idea is to create an animation of an avatar that moves their hands similarly to how a pianist or violinist would do, just from audio. Aiming for fully detailed and correct arm and finger motions is the ultimate goal; however, it’s not clear if body movement can be predicted from music at all. In this paper, we present the first result that shows that natural body dynamics can be predicted. We built an LSTM network that is trained on violin and piano recital videos uploaded to the internet. The predicted points are applied onto a rigged avatar to create the animation.

CondenseNet: An Efficient DenseNet Using Learned Group Convolutions
Gao Huang, Shichen Liu, Laurens van der Maaten, Kilian Q. Weinberger

Deep neural networks are increasingly used on mobile devices, where computational resources are limited. In this paper we develop CondenseNet, a novel network architecture with unprecedented efficiency. It combines dense connectivity with a novel module called learned group convolution. The dense connectivity facilitates feature re-use in the network, whereas learned group convolutions remove connections between layers for which this feature re-use is superfluous. At test time, our model can be implemented using standard group convolutions, allowing for efficient computation in practice. Our experiments show that CondenseNets are far more efficient than state-of-the-art compact convolutional networks such as ShuffleNets.

Data Distillation: Towards Omni-Supervised Learning
Ilija Radosavovic, Piotr Dollár, Ross Girshick, Georgia Gkioxari, Kaiming He

We propose omni-supervised learning, a special regime of semi-supervised learning in which the learner exploits all available labeled data plus internet-scale sources of unlabeled data. Omni-supervised learning is lower-bounded by performance on existing labeled datasets, offering the potential to surpass state-of-the-art fully supervised methods. To exploit the omni-supervised setting, we propose data distillation, a method that ensembles predictions from multiple transformations of unlabeled data, using a single model, to automatically generate new training annotations. We argue that visual recognition models have recently become accurate enough that it is now possible to apply classic ideas about self-training to challenging real-world data. Our experimental results show that in the cases of human keypoint detection and general object detection, state-of-the-art models trained with data distillation surpass the performance of using labeled data from the COCO dataset alone.

Deep Spatio-Temporal Random Fields for Efficient Video Segmentation
Siddhartha Chandra, Camille Couprie, Iasonas Kokkinos

In this work we introduce a time- and memory-efficient method for structured prediction that couples neuron decisions across both space at time. We show that we are able to perform exact and efficient inference on a densely-connected spatio-temporal graph by capitalizing on recent advances on deep Gaussian random fields. We experiment with multiple connectivity patterns in the temporal domain, and present empirical improvements over strong baselines on the tasks of both semantic and instance segmentation of videos. Our proposed approach is (a) efficient, (b) has a unique global minimum, and (c) can be trained end-to-end alongside contemporary deep networks for video understanding. Our implementation is based on the Caffe2 framework and will be made publicly available.

DeepMVS: Learning Multi-View Stereopsis
Po-Han Huang, Kevin Matzen, Johannes Kopf, Narendra Ahuja, Jia-Bin Huang

This method uses deep learning to compute dense depth maps for a collection of images with known camera poses.

DensePose: Dense Human Pose Estimation in the Wild
Rıza Alp Güler, Natalia Neverova, Iasonas Kokkinos

In this work we establish dense correspondences between an RGB image and a surface-based representation of the human body, a task we refer to as dense human pose estimation. We gather dense correspondences for 50K persons appearing in the COCO dataset by introducing an efficient annotation pipeline. We then use our dataset to train CNN-based systems that deliver dense correspondence ‘in the wild’, namely in the presence of background, occlusions and scale variations. We improve our training set’s effectiveness by training an inpainting network that can fill in missing ground truth values and report improvements with respect to the best results that would be achievable in the past. We experiment with fully convolutional networks and region-based models and observe a superiority of the latter. We further improve accuracy through cascading, obtaining a system that delivers highly accurate results at multiple frames per second on a single gpu. Supplementary materials, data, code and videos are provided on the project page.

Detail-Preserving Pooling in Deep Networks
Faraz Saeedan, Nicolas Weber, Michael Goesele, Stefan Roth

Pooling layers are standard layers in deep neural networks that serve to collect and condense information. We propose in this paper a generic novel pooling layer whose parameters can be learned. This is inspired by an earlier work on image downsampling, published by my research group at TU Darmstadt. This new pooling layer yields a moderate but consistent improvement for a wide range of networks.

Detect-and-Track: Efficient Pose Estimation in Videos
Rohit Girdhar, Georgia Gkioxari, Lorenzo Torresani, Manohar Paluri, Du Tran

We propose a simple, efficient and effective approach to track multi-body human key points in videos. Our proposed method builds upon the state-of-the-art in single-frame pose estimation (Mask-RCNN), and adds a light-weight tracking module on top of the frame level predictions to generate keypoint predictions linked in time. %Compared to previous methods our approach is significantly easier to implement and highly scalable. We conduct extensive ablative experiments on the newly released multi-person video pose estimation benchmark, PoseTrack, to validate various design choices of our model. Our model achieves an accuracy of 55% on the validation and 51.8% on the test set using the Multi-Object Tracking Accuracy (MOTA) metric, and achieves state-of-the-art performance on the ICCV 2017 PoseTrack keypoint tracking challenge.

Detecting and Recognizing Human-Object Interactions
Georgia Gkioxari, Ross Girshick, Piotr Dollár, Kaiming He

To understand the visual world, a machine must not only recognize individual object instances but also how they interact. Humans are often at the center of such interactions and detecting human-object interactions is an important practical and scientific problem. In this paper, we address the task of detecting <human, verb, object> triplets in challenging everyday photos. We propose a novel model that is driven by a human-centric approach. Our hypothesis is that the appearance of a person—their pose, clothing, action—is a powerful cue for localizing the objects they are interacting with. To exploit this cue, our model learns to predict an action-specific density over target object locations based on the appearance of a detected person. Our model also jointly learns to detect people and objects, and by fusing these predictions it efficiently infers interaction triplets in a clean, jointly trained end-to-end system we call InteractNet. We validate our approach on the recently introduced Verbs in COCO (V-COCO) and HICO-DET datasets, where we show quantitatively compelling results.

Don’t Just Assume, Look and Answer: Overcoming Priors for Visual Question Answering
Aishwarya Agrawal, Dhruv Batra, Devi Parikh, Aniruddha Kembhavi

A number of studies have found that today’s Visual Question Answering (VQA) models are heavily driven by superficial correlations in the training data and lack sufficient image grounding. To encourage development of models geared towards the latter, we propose a new setting for VQA where for every question type, train and test sets have different prior distributions of answers. Specifically, we present new splits of the VQA v1 and VQA v2 datasets, which we call Visual Question Answering under Changing Priors (VQA-CP v1 and VQA-CP v2 respectively). First, we evaluate several existing VQA models under this new setting and show that their performance degrades significantly compared to the original VQA setting. Second, we propose a novel Grounded Visual Question Answering model (GVQA) that contains inductive biases and restrictions in the architecture specifically designed to prevent the model from ‘cheating’ by primarily relying on priors in the training data. Specifically, GVQA explicitly disentangles the recognition of visual concepts present in the image from the identification of plausible answer space for a given question, enabling the model to more robustly generalize across different distributions of answers. GVQA is built off an existing VQA model—Stacked Attention Networks (SAN). Our experiments demonstrate that GVQA significantly outperforms SAN on both VQA-CP v1 and VQA-CP v2 datasets. Interestingly, it also outperforms more powerful VQA models such as Multimodal Compact Bilinear Pooling (MCB) in several cases. GVQA offers strengths complementary to SAN when trained and evaluated on the original VQA v1 and VQA v2 datasets. Finally, GVQA is more transparent and interpretable than existing VQA models.

Embodied Question Answering
Abhishek Das, Samyak Datta, Georgia Gkioxari, Stefan Lee, Devi Parikh, Dhruv Batra

We present a new AI task—Embodied Question Answering (EmbodiedQA)—where an agent is spawned at a random location in a 3D environment and asked a natural language question (“What color is the car?”). In order to answer, the agent must first intelligently navigate to explore the environment, gather information through first-person (egocentric) vision, and then answer the question (“orange”).

Eye In-Painting with Exemplar Generative Adversarial Networks
Brian Dolhansky, Cristian Canton Ferrer

This paper introduces a novel approach to in-painting where the identity of the object to remove or change is preserved and accounted for at inference time: Exemplar GANs (ExGANs). ExGANs are a type of conditional GAN that utilize exemplar information to produce high-quality, personalized in painting results. We propose using exemplar information in the form of a reference image of the region to in-paint, or a perceptual code describing that object. Unlike previous conditional GAN formulations, this extra information can be inserted at multiple points within the adversarial network, thus increasing its descriptive power. We show that ExGANs can produce photo-realistic personalized in-painting results that are both perceptually and semantically plausible by applying them to the task of closed-to-open eye in-painting in natural pictures. A new benchmark dataset is also introduced for the task of eye in-painting for future comparisons.

Improving Landmark Localization with Semi-Supervised Learning
Sina Honari, Pavlo Molchanov, Stephen Tyree, Pascal Vincent, Christopher Pal, Jan Kautz

We improve the precision of automated location of finger joints positions and facial landmarks positions (eyes, eyebrows, mouth contour…). Improvement is achieved by leveraging extra images without precise annotated positions, but that have related labels (hand gesture type, head pose, or facial expression emotion label).

LAMV: Learning to Align and Match Videos with Kernelized Temporal Layers
Lorenzo Baraldi, Matthijs Douze, Rita Cucchiara, Hervé Jégou

This paper considers a learnable approach for comparing and aligning videos. Our architecture builds upon and revisits temporal match kernels within neural networks: we propose a new temporal layer that finds temporal alignments by maximizing the scores between two sequences of vectors, according to a time-sensitive similarity metric parametrized in the Fourier domain. We learn this layer with a temporal proposal strategy, in which we minimize a triplet loss that takes into account both the localization accuracy and the recognition rate.

Learning by Asking Questions
Ishan Misra, Ross Girshick, Rob Fergus, Martial Hebert, Abhinav Gupta, Laurens van der Maaten

In this paper, we introduce an interactive learning setting for the development and testing of intelligent visual systems, called learning-by-asking (LBA). LBA is evaluated in the same way as the Visual Question Answering (VQA) task. LBA differs from VQA in that most questions are not observed during training time, and the learner must ask questions it wants answers to. Thus, LBA more closely mimics natural learning and has the potential to be more data-efficient than the traditional VQA setting. We present a model that performs LBA on the CLEVR dataset, and show that it automatically discovers an easy-to-hard curriculum when learning interactively from an oracle. Our LBA generated data consistently matches or outperforms the CLEVR train data and is more sample efficient. We also show that our model asks questions that generalize to state-of-the-art VQA models and to novel test time distributions.

Learning Patch Reconstructability for Accelerating Multi-View Stereo
Alex Poms, Chenglei Wu, Shoou-I Yu, Yaser Sheikh

We make the process of performing Multi-View Stereo (3D reconstruction) from a large set of images significantly faster. The key idea is to very quickly figure out what portions of the image are not useful, so that we can just compute what we need and complete 3D reconstruction quicker. We propose to utilize deep learning to predict which portions of the images are not useful (i.e. low patch reconstructability).

Learning to Segment Everything
Ronghang Hu, Piotr Dollár, Kaiming He, Trevor Darrell, Ross Girshick

Existing methods for object instance segmentation require all training instances to be labeled with segmentation masks. This requirement makes it expensive to annotate new categories and has restricted instance segmentation models to ∼100 well-annotated classes. The goal of this paper is to propose a new partially supervised training paradigm, together with a novel weight transfer function, that enables training instance segmentation models over a large set of categories for which all have box annotations, but only a small fraction have mask annotations. These contributions allow us to train Mask R-CNN to detect and segment 3000 visual concepts using box annotations from the Visual Genome dataset and mask annotations from the 80 classes in the COCO dataset. We carefully evaluate our proposed approach in a controlled study on the COCO dataset. This work is a first step towards instance segmentation models that have broad comprehension of the visual world.

Link and Code: Fast Indexing with Graphs and Compact Regression Codes
Matthijs Douze, Alexandre Sablayrolles, Hervé Jégou

Similarity search approaches based on graph walks have recently attained outstanding speed-accuracy trade-offs, taking aside the memory constraints. In this paper, we revisit these approaches by considering, additionally, the memory constraint required to index billions of images on a single server. This leads to propose a method based both on graph traversal and compact representations. We encode the indexed vectors using quantization and exploit the graph structure to refine the similarity estimation. In essence, our method takes the best of these two different worlds: the search strategy is based on nested graphs, thereby providing high precision with a relatively small set of comparisons. At the same time it offers a significant memory compression. As a result, our approach outperforms the state of the art on operating points considering 64-128 bytes per vector, as demonstrated by our results on two billion-scale public benchmarks.

Low-Shot Learning from Imaginary Data
Yu-Xiong Wang, Ross Girshick, Martial Hebert, Bharath Hariharan

This paper considers the problem of inferring image labels from images when only a few labelled examples are available at training time. This setup is often referred to as low-shot learning in the literature, where a standard approach is to re-train the last few layers of a convolutional neural network learned on separate classes. We consider a semi-supervised setting in which we exploit a large collection of images to support label propagation. This is made possible by leveraging the recent advances on large-scale similarity graph construction. We show that despite its conceptual simplicity, scaling up label propagation to up hundred millions of images leads to state of the art accuracy in the low-shot learning regime.

Low-Shot Learning with Large-Scale Diffusion
Matthijs Douze, Arthur Szlam, Bharath Hariharan, Hervé Jégou

Humans can quickly learn new visual concepts, perhaps because they can easily visualize or imagine what novel objects look like from different views. Incorporating this ability to hallucinate novel instances of new concepts might help machine vision systems perform better low-shot learning, i.e., learning concepts from few examples. We present a novel approach to low-shot learning that uses this idea. Our approach builds on recent progress in meta-learning (“learning to learn”) by combining a meta-learner with a “hallucinator” that produces additional training examples, and optimizing both models jointly. Our hallucinator can be incorporated into a variety of meta-learners and provides significant gains: up to a 6-point boost in classification accuracy when only a single training example is available, yielding state-of-the-art performance on the challenging ImageNet low-shot classification benchmark.

Modeling Facial Geometry Using Compositional VAEs
Timur Bagautdinov, Chenglei Wu, Jason Saragih, Pascal Fua, Yaser Sheikh

We propose a method for learning non-linear face geometry representations using deep generative models. Our model is a variational autoencoder with multiple levels of hidden variables where lower layers capture global geometry and higher ones encode more local deformations. Based on that, we propose a new parameterization of facial geometry that naturally decomposes the structure of the human face into a set of semantically meaningful levels of detail. This parameterization enables us to do model fitting while capturing varying level of detail under different types of geometrical constraints.

Multimodal Explanations: Justifying Decisions and Pointing to the Evidence
Dong Huk Par, Lisa Anne Hendricks, Zeynep Akata, Anna Rohrbach, Bernt Schiele, Trevor Darrell, Marcus Rohrbach

Deep models are the defacto standard in visual decision problems due to their impressive performance on a wide array of visual tasks. On the other hand, their opaqueness has led to a surge of interest in explainable systems. In this work, we emphasize the importance of model explanation in various forms such as visual pointing and textual justification. The lack of data with justification annotations is one of the bottlenecks of generating multimodal explanations. Thus, we propose two large-scale datasets with annotations that visually and textually justify a classification decision for various activities, i.e. ACT-X, and for question answering, i.e. VQA-X. We also introduce a multimodal methodology for generating visual and textual explanations simultaneously. We quantitatively show that training with the textual explanations not only yields better textual justification models, but also models that better localize the evidence that support their decision.

Neural Baby Talk
Jiasen Lu, Jianwei Yang, Dhruv Batra, Devi Parikh

We introduce a novel framework for image captioning that can produce natural language explicitly grounded in entities object detectors find in the image. It reconciles the slot filling approaches (known to be better grounded in images) and neural captioning approaches (known to be more natural sounding). Our approach first generates a (neural) sentence “template” with slot locations explicitly tied to image regions. These slots are then filled in by concepts object detectors found in the associated regions. Our approach can resolve the visual co-reference from weak supervision and utilize the novel concepts to generate novel captions found by object detectors. We verify the effectiveness of our proposed model on different image captioning tasks. On standard image captioning and novel object captioning, our model reaches state-of-the-art on both MS-COCO and Flickr30k dataset. We also demonstrate that our model has unique advantages when the train and test distributions of scene compositions—and hence language priors of associated captions—are different. Code and the proposed split have been made available here.

Non-Local Neural Networks
Xiaolong Wang, Ross Girshick, Abhinav Gupta, Kaiming He

Both convolutional and recurrent operations are building blocks that process one local neighborhood at a time. In this paper, we present non-local operations as a generic family of building blocks for capturing long-range dependencies. Inspired by the classical non-local means method [4] in computer vision, our non-local operation computes the response at a position as a weighted sum of the features at all positions. This building block can be plugged into many computer vision architectures. On the task of video classification, even without any bells and whistles, our nonlocal models can compete or outperform current competition winners on both Kinetics and Charades datasets. In static image recognition, our non-local models improve object detection/segmentation and pose estimation on the COCO suite of tasks. Code will be made available.

Separating Self-Expression and Visual Content in Hashtag Supervision
Andreas Veit, Maximilian Nickel, Serge Belongie, Laurens van der Maaten

The variety, abundance, and structured nature of hashtags make them an interesting data source for training vision models. For instance, hashtags have the potential to significantly reduce the problem of manual supervision and annotation when learning vision models for a large number of concepts. However, a key challenge when learning from hashtags is that they are inherently subjective because they are provided by users as a form of self-expression. As a consequence, hashtags may have synonyms (different hashtags referring to the same visual content) and may be polysemous (the same hashtag referring to different visual content). These challenges limit the effectiveness of approaches that simply treat hashtags as image-label pairs. This paper presents an approach that extends upon modeling simple image-label pairs with a joint model of images, hashtags, and users. We demonstrate the efficacy of such approaches in image tagging and retrieval experiments, and show how the joint model can be used to perform user-conditional retrieval and tagging.

Stacked Latent Attention for Multimodal Reasoning
Haoqi Fan, Jiatong Zhou

Attention has shown to be a pivotal development in deep learning and has been used for a multitude of multimodal learning tasks such as visual question answering and image captioning. In this work, we pinpoint the potential limitations to the design of a traditional attention model. We identify that 1) current attention mechanisms discard the information from intermediate reasoning, losing the positional information already captured by the attention heatmaps and 2) stacked attention, a common way to improve spatial reasoning, may have suboptimal performance because of the vanishing gradient problem. We introduce a novel attention architecture to address these problems, in which all spatial configuration information contained in the intermediate reasoning process is retained in a pathway of convolutional layers. We show that this new attention leads to substantial improvements in multiple multimodal reasoning tasks including achieving the best single model performance without using external knowledge on the VQA dataset as well as clear gains for the image captioning task.

Supervision-by-Registration: An Unsupervised Approach to Improve the Precision of Facial Landmark Detectors
Xuanyi Dong, Shoou-I Yu, Xinshuo Weng, Shih-En Wei, Yi Yang, Yaser Sheikh

We propose a method to find facial landmarks (e.g. corner of eyes, corner of mouth, tip of nose, etc) more precisely. Our method utilizes the fact that objects move smoothly in a video sequence (i.e. optical flow registration) to improve an existing facial landmark detector. The key novelty is that no additional human annotations are necessary to improve the detector, hence it is an “unsupervised approach.”

Total Capture: A 3D Deformation Model for Tracking Faces, Hands, and Bodies
Hanbyul Joo, Tomas Simon, Yaser Sheikh

We present a unified 3D model for the markerless capture of human movement, including facial expressions, body motion, and hand gestures. An initial model is generated by locally stitching together models of the individual parts of the human body, which we refer to as the “Frankenstein” model. Using a large-scale capture of people wearing everyday clothes, we optimize the Frankenstein model to create “Adam.” Adam is a calibrated model that shares the same skeleton hierarchy as the initial model but can express hair and clothing geometry, making it directly usable for fitting people as they normally appear in everyday life. Finally, we demonstrate the use of these models for total motion tracking, simultaneously capturing the large-scale body movements and the subtle face and hand motion of a social group of people.

Unsupervised Correlation Analysis
Yedid Hoshen, Lior Wolf

Humans are able to make analogies across domains even without supervision. Significant progress has recently been made on learning analogies between multiple image domains. In this paper, we extend the scope of unsupervised analogies by tackling the problem of learning analogies between different modalities without supervision, namely between text and images. The state-of-the-art technique for this task is Canonical Correlation Analysis (CCA). It is however a supervised approach and is not suitable for this task. This task is currently unsolved by unsupervised methods. We therefore introduce a new algorithm, Unsupervised Correlation Analysis (UCA). UCA is able to learn a joint representation between text and images without any supervised matches. We show experimentally that UCA is able to find analogies between text and images without supervision, as well as analogies between very different image domain.

What Makes a Video a Video: Analyzing Temporal Information in Video Understanding Models and Datasets
De-An Huang, Vignesh Ramanathan, Dhruv Mahajan, Lorenzo Torresani, Manohar Paluri, Li Fei-Fei, Juan Carlos Niebles

The ability to capture temporal information stimulates the development of video understanding models. While there have been various models aiming to address the challenge of temporal modeling and have shown empirical improvements, a detailed understanding and explicit analysis of the effect of temporal information on video understanding is still missing. In this work, we aim to bridge this gap and ask the following question: How important is the temporal information in the video for recognizing the action? To this end, we propose two novel frameworks: (i) motion prior generator and (ii) order-invariant frame selector to recover information from the video while maintaining a constant amount of temporal information from the video. This isolates the analysis of temporal information from others. The propose frameworks allow us to significantly reduce the upper bound for the effect of temporal information (from 25% to 6% on UCF101 and 15% to 5% on Kinetics) in our analysis. More interestingly, we show that without using any motion information from the video we aim to classify, the upper bound by an oracle order-invariant frame selector can actually outperform the original video model.

Other activities at CVPR 2018


DeepGlobe Workshop: A Challenge to Parse the Earth through Satellite Images
Ilke Demir and Manohar Paluri, organizers
Guan Pang, Jing Huang, Saikat Basu and Forest Hughes, technical team

Paper: DeepGlobe 2018: A Challenge to Parse the Earth through Satellite Images
Ilke Demir, Guan Pang, Jing Huang and Saikat Basu

Deep Vision workshop
Dhruv Batra and Yann LeCun, organizers

Efficient Deep Learning for Computer Vision workshop
Peter Vajda and Fernando De la Torre, organizers

The VQA Challenge and Visual Dialog workshop
Dhruv Batra and Devi Parikh, organizers

Women in Computer Vision Workshop
Adriana Romero Soriano and Ilke Demir, organizers
Jessica Hodgins and Jitendra Malik, speakers/panelists

Workshop on Autonomous Driving
Paper: On the iterative refinement of densely connected representation levels for semantic segmentation
Adriana Romero Soriano and Michal Drozdzal

Tutorials and Panels

How to be a good citizen of the CVPR community panel
Devi Parikh and Dhruv Batra, organizers
Georgia Gkioxari, Jitendra Malik and Kristen Grauman, speakers

Interpretable Machine Learning for Computer Vision tutorial
Laurens van der Maaten, speaker

Visual Recognition and Beyond tutorial
Ross Girshick, Kaiming He and Georgia Gkioxari, speakers