Publication

Towards VQA Models That Can Read

Conference Computer Vision and Pattern Recognition (CVPR)


Abstract

Studies have shown that a dominant class of questions asked by visually impaired users on images of their surroundings involves reading text in the image. But today’s VQA models can not read! Our paper takes a first step towards addressing this problem. First, we introduce a new “TextVQA” dataset to facilitate progress on this important problem. Existing datasets either have a small proportion of questions about text (e.g., the VQA dataset) or are too small (e.g., the VizWiz dataset). TextVQA contains 45,336 questions on 28,408 images that require reasoning about text to answer. Second, we introduce a novel model architecture that reads text in the image, reasons about it in the context of the image and the question, and predicts an answer which might be a deduction based on the text and the image or is composed of the strings found in the image. Consequently, we call our approach Look, Read, Reason & Answer (LoRRA)1. We show that LoRRA outperforms existing state-of-the-art VQA models on our TextVQA dataset. We find that the gap between human performance and machine performance is significantly larger on TextVQA than on VQA 2.0, suggesting that TextVQA is well-suited to benchmark progress along directions complementary to VQA 2.0.

https://textvqa.org

See codebase on GitHub here

Related Publications

All Publications

Federated Learning for User Privacy and Data Confidentiality Workshop At ICML - July 24, 2021

Federated Learning with Buffered Asynchronous Aggregation

John Nguyen, Kshitiz Malik, Hongyuan Zhan, Ashkan Yousefpour, Michael Rabbat, Mani Malek, Dzmitry Huba

ISMAR - July 29, 2021

Instant Visual Odometry Initialization for Mobile AR

Alejo Concha, Michael Burri, Jesus Briales, Christian Forster, Luc Oth

UAI - July 28, 2021

A Nonmyopic Approach to Cost-Constrained Bayesian Optimization

Eric Hans Lee, David Eriksson, Valerio Perrone, Matthias Seeger

ACM MM - October 20, 2021

EVRNet: Efficient Video Restoration on Edge Devices

Sachin Mehta, Amit Kumar, Fitsum Reda, Varun Nasery, Vikram Mulukutla, Rakesh Ranjan, Vikas Chandra

To help personalize content, tailor and measure ads, and provide a safer experience, we use cookies. By clicking or navigating the site, you agree to allow our collection of information on and off Facebook through cookies. Learn more, including about available controls: Cookies Policy