Situated and Interactive Multimodal Conversations

International Conference on Computational Linguistics (COLING)


Next generation virtual assistants are envisioned to handle multimodal inputs (e.g., vision, memories of previous interactions, and the user’s utterances), and perform multimodal actions (e.g., displaying a route while generating the system’s utterance). We introduce Situated Interactive MultiModal Conversations (SIMMC) as a new direction aimed at training agents that take multimodal actions grounded in a co-evolving multimodal input context in addition to the dialog history. We provide two SIMMC datasets totalling ∼13K human-human dialogs (∼169K utterances) collected using a multimodal Wizard-of-Oz (WoZ) setup, on two shopping domains: (a) furniture – grounded in a shared virtual environment; and (b) fashion – grounded in an evolving set of images. Datasets include multimodal context of the items appearing in each scene, and contextual NLU, NLG and coreference annotations using a novel and unified framework of SIMMC conversational acts for both user and assistant utterances.

Finally, we present several tasks within SIMMC as objective evaluation protocols, such as structural API prediction, response generation, and dialog state tracking. We benchmark a collection of existing models on these SIMMC tasks as strong baselines, and demonstrate rich multimodal conversational interactions. Our data, annotations, and models are publicly available.

Related Publications

All Publications

SIGGRAPH - August 2, 2021

Fast Diffraction Pathfinding for Dynamic Sound Propagation

Carl Schissler, Gregor Mückl, Paul Calamia

ISMAR - July 29, 2021

Instant Visual Odometry Initialization for Mobile AR

Alejo Concha, Michael Burri, Jesus Briales, Christian Forster, Luc Oth

ICSA - November 6, 2019

Auralization systems for simulation of augmented reality experiences in virtual environments

Peter Dodds, Sebastià V. Amengual Garí, W. Owen Brimijoin, Philip W. Robinson

Journal of the Audio Engineering Society - July 20, 2021

Six-Degrees-of-Freedom Parametric Spatial Audio Based on One Monaural Room Impulse Response

Johannes M. Arend, Sebastià V. Amengual Garí, Carl Schissler, Florian Klein, Philip W. Robinson

To help personalize content, tailor and measure ads, and provide a safer experience, we use cookies. By clicking or navigating the site, you agree to allow our collection of information on and off Facebook through cookies. Learn more, including about available controls: Cookies Policy