Revisiting Visual Question Answering Baselines

European Conference on Computer Vision 2016


Visual question answering (VQA) is an interesting learning setting for evaluating the abilities and shortcomings of current systems for image understanding. Many of the recently proposed VQA systems include attention or memory mechanisms designed to perform “reasoning”. Furthermore, for the task of multiple-choice VQA, nearly all of these systems train a multi-class classifier on image and question features to predict an answer. This paper questions the value of these common practices and develops a simple alternative model based on binary classification. Instead of treating answers as competing choices, our model receives the answer as input and predicts whether or not an image-question-answer triplet is correct. We evaluate our model on the Visual7W Telling and the VQA Real Multiple Choice tasks, and find that even simple versions of our model perform competitively. Our best model achieves state-of-the-art performance of 65.8% accuracy on the Visual7W Telling task and compares surprisingly well with the most complex systems proposed for the VQA Real Multiple Choice task. Additionally, we explore variants of the model and study the transferability of the model between both datasets. We also present an error analysis of our best model, the results of which suggest that a key problem of current VQA systems lies in the lack of visual grounding and localization of concepts that occur in the questions and answers.

Related Publications

All Publications

A Scalable Approach to Control Diverse Behaviors for Physically Simulated Characters

Jungdam Won, Deepak Gopinath, Jessica Hodgins

ACM SIGGRAPH - July 19, 2020

ARCH: Animatable Reconstruction of Clothed Humans

Zeng Huang, Yuanlu Xu, Christoph Lassner, Hao Li, Tony Tung

CVPR - June 15, 2020

In Defense of Grid Features for Visual Question Answering

Huaizu Jiang, Ishan Misra, Marcus Rohrbach, Erik Learned-Miller, Xinlei Chen

CVPR - June 14, 2020

Hierarchical Scene Coordinate Classification and Regression for Visual Localization

Xiaotian Li, Shuzhe Wang, Yi Zhao, Jakob Verbeek, Juho Kannala

CVPR - June 13, 2020

To help personalize content, tailor and measure ads, and provide a safer experience, we use cookies. By clicking or navigating the site, you agree to allow our collection of information on and off Facebook through cookies. Learn more, including about available controls: Cookies Policy