Dual attention networks for visual reference resolution in visual dialog

Kang, Gi-Cheon; Lim, Jaeseo; Zhang, Byoung-Tak

Repository landing page

oai:s-space.snu.ac.kr:10371/179317

Dual attention networks for visual reference resolution in visual dialog

Authors: Gi-Cheon Kang
Jaeseo Lim
Byoung-Tak Zhang
Publication date: 1 January 2020
Publisher: 'Association for Computational Linguistics (ACL)'

Abstract

© 2019 Association for Computational LinguisticsVisual dialog (VisDial) is a task which requires a dialog agent to answer a series of questions grounded in an image. Unlike in visual question answering (VQA), the series of questions should be able to capture a temporal context from a dialog history and utilizes visually-grounded information. Visual reference resolution is a problem that addresses these challenges, requiring the agent to resolve ambiguous references in a given question and to find the references in a given image. In this paper, we propose Dual Attention Networks (DAN) for visual reference resolution in VisDial. DAN consists of two kinds of attention modules, REFER and FIND. Specifically, REFER module learns latent relationships between a given question and a dialog history by employing a multi-head attention mechanism. FIND module takes image features and reference-aware representations (i.e., the output of REFER module) as input, and performs visual grounding via bottom-up attention mechanism. We qualitatively and quantitatively evaluate our model on VisDial v1.0 and v0.9 datasets, showing that DAN outperforms the previous state-of-the-art model by a significant margin.N

Similar works

Full text

SNU Open Repository and Archive

oai:s-space.snu.ac.kr:10371/17...

Last time updated on 06/07/2022

This paper was published in SNU Open Repository and Archive.

Having an issue?

Is data on this page outdated, violates copyrights or anything else? Report the problem now and we will take corresponding actions after reviewing your request.