Efficient Modeling of Future Context for Image Captioning

Fei, Zhengcong; Huang, Junshi; Wei, Xiaoming; Wei, Xiaolin

Computer Science > Computer Vision and Pattern Recognition

arXiv:2207.10897 (cs)

[Submitted on 22 Jul 2022 (v1), last revised 18 Oct 2022 (this version, v2)]

Title:Efficient Modeling of Future Context for Image Captioning

Authors:Zhengcong Fei, Junshi Huang, Xiaoming Wei, Xiaolin Wei

View PDF

Abstract:Existing approaches to image captioning usually generate the sentence word-by-word from left to right, with the constraint of conditioned on local context including the given image and history generated words. There have been many studies target to make use of global information during decoding, e.g., iterative refinement. However, it is still under-explored how to effectively and efficiently incorporate the future context. To respond to this issue, inspired by that Non-Autoregressive Image Captioning (NAIC) can leverage two-side relation with modified mask operation, we aim to graft this advance to the conventional Autoregressive Image Captioning (AIC) model while maintaining the inference efficiency without extra time cost. Specifically, AIC and NAIC models are first trained combined with shared visual encoders, forcing the visual encoder to contain sufficient and valid future context; then the AIC model is encouraged to capture the causal dynamics of cross-layer interchanging from NAIC model on its unconfident words, which follows a teacher-student paradigm and optimized with the distribution calibration training objective. Empirical evidences demonstrate that our proposed approach clearly surpass the state-of-the-art baselines in both automatic metrics and human evaluations on the MS COCO benchmark. The source code is available at: this https URL.

Comments:	ACM Multimedia 2022
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2207.10897 [cs.CV]
	(or arXiv:2207.10897v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2207.10897

Submission history

From: Zhengcong Fei [view email]
[v1] Fri, 22 Jul 2022 06:21:43 UTC (6,281 KB)
[v2] Tue, 18 Oct 2022 05:57:50 UTC (1,783 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Efficient Modeling of Future Context for Image Captioning

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Efficient Modeling of Future Context for Image Captioning

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators