ECCV 2020
Improving Vision-and-Language Navigation with Web Image-Text Pairs
Following a navigation instruction such as ‘Walk down the stairs and stop at the brown sofa’ requires embodied AI agents to ground referenced scene elements referenced (e.g. ‘stairs’) to visual content in the environment (pixels corresponding to ‘stairs’). We ask the following question – can we leverage abundant ‘disembodied’ web-scraped vision-and-language corpora (e.g. Conceptual Captions) to learn the visual groundings that improve performance on a relatively data-starved embodied perception task (Vision-and-Language Navigation)? (Read more)
All The Latest

October 14, 2020
Innovation comes through exploration and challenging our assumptions

August 31, 2020
Are Labels Necessary for Neural Architecture Search?

August 24, 2020