Situated and Interactive Multimodal Conversations

International Conference on Computational Linguistics (COLING)


Next generation virtual assistants are envisioned to handle multimodal inputs (e.g., vision, memories of previous interactions, and the user’s utterances), and perform multimodal actions (e.g., displaying a route while generating the system’s utterance). We introduce Situated Interactive MultiModal Conversations (SIMMC) as a new direction aimed at training agents that take multimodal actions grounded in a co-evolving multimodal input context in addition to the dialog history. We provide two SIMMC datasets totalling ∼13K human-human dialogs (∼169K utterances) collected using a multimodal Wizard-of-Oz (WoZ) setup, on two shopping domains: (a) furniture – grounded in a shared virtual environment; and (b) fashion – grounded in an evolving set of images. Datasets include multimodal context of the items appearing in each scene, and contextual NLU, NLG and coreference annotations using a novel and unified framework of SIMMC conversational acts for both user and assistant utterances.

Finally, we present several tasks within SIMMC as objective evaluation protocols, such as structural API prediction, response generation, and dialog state tracking. We benchmark a collection of existing models on these SIMMC tasks as strong baselines, and demonstrate rich multimodal conversational interactions. Our data, annotations, and models are publicly available.

Related Publications

All Publications

3DV - November 18, 2021

Recovering Real-World Reflectance Properties and Shading From HDR Imagery

Bjoern Haefner, Simon Green, Alan Oursland, Daniel Andersen, Michael Goesele, Daniel Cremers, Richard Newcombe, Thomas Whelan

ICCV - October 11, 2021

ARCH++: Animation-Ready Clothed Human Reconstruction Revisited

Tong He, Yuanlu Xu, Shunsuke Saito, Stefano Soatto, Tony Tung

ICASSP - June 10, 2021

The Far-Field Equatorial Array for Binaural Rendering

Jens Ahrens, Hannes Helmholz, David Lou Alon, Sebastià V. Amengual Garí

ICCV - October 4, 2021

Deep 3D Mask Volume for View Synthesis of Dynamic Scenes

Kai-En Lin, Lei Xiao, Feng Liu, Guowei Yang, Ravi Ramamoorthi

To help personalize content, tailor and measure ads, and provide a safer experience, we use cookies. By clicking or navigating the site, you agree to allow our collection of information on and off Facebook through cookies. Learn more, including about available controls: Cookies Policy