International Journal of Computer Vision ( IF 11.6 ) Pub Date : 2024-10-08 , DOI: 10.1007/s11263-024-02220-6 Jiuniu Wang, Wenjia Xu, Qingzhong Wang, Antoni B. Chan
Recent advances in image captioning have focused on enhancing accuracy by substantially increasing the dataset and model size. While conventional captioning models exhibit high performance on established metrics such as BLEU, CIDEr, and SPICE, the capability of captions to distinguish the target image from other similar images is under-explored. To generate distinctive captions, a few pioneers employed contrastive learning or re-weighted the ground-truth captions. However, these approaches often overlook the relationships among objects in a similar image group (e.g., items or properties within the same album or fine-grained events). In this paper, we introduce a novel approach to enhance the distinctiveness of image captions, namely Group-based Differential Distinctive Captioning Method, which visually compares each image with other images in one similar group and highlights the uniqueness of each image. In particular, we introduce a Group-based Differential Memory Attention (GDMA) module, designed to identify and emphasize object features in an image that are uniquely distinguishable within its image group, i.e., those exhibiting low similarity with objects in other images. This mechanism ensures that such unique object features are prioritized during caption generation for the image, thereby enhancing the distinctiveness of the resulting captions. To further refine this process, we select distinctive words from the ground-truth captions to guide both the language decoder and the GDMA module. Additionally, we propose a new evaluation metric, the Distinctive Word Rate (DisWordRate), to quantitatively assess caption distinctiveness. Quantitative results indicate that the proposed method significantly improves the distinctiveness of several baseline models, and achieves state-of-the-art performance on distinctiveness while not excessively sacrificing accuracy. Moreover, the results of our user study are consistent with the quantitative evaluation and demonstrate the rationality of the new metric DisWordRate.
中文翻译:
具有记忆差异编码和注意力的基于组的独特图像描述
图像字幕的最新进展主要集中在通过大幅增加数据集和模型大小来提高准确性。虽然传统的字幕模型在 BLEU、CIDEr 和 SPICE 等既定指标上表现出高性能,但字幕区分目标图像和其他类似图像的能力尚未得到充分探索。为了生成独特的字幕,一些先驱者采用了对比学习或重新加权了真实字幕。然而,这些方法经常忽略相似图像组中的对象之间的关系(例如,同一相册或细粒度事件中的项目或属性)。在本文中,我们介绍了一种增强图像标题独特性的新方法,即基于组的差异独特标题方法,该方法将每个图像与相似组中的其他图像进行直观比较,并突出每个图像的独特性。特别是,我们引入了基于组的差分记忆注意力(GDMA)模块,旨在识别和强调图像中在其图像组中唯一可区分的对象特征,即与其他图像中的对象表现出低相似性的对象特征。这种机制确保在图像的说明生成过程中优先考虑这些独特的对象特征,从而增强生成的说明的独特性。为了进一步完善这个过程,我们从真实字幕中选择独特的单词来指导语言解码器和 GDMA 模块。此外,我们提出了一种新的评估指标,即独特词率(DisWordRate),以定量评估标题的独特性。 定量结果表明,所提出的方法显着提高了几个基线模型的独特性,并在独特性上实现了最先进的性能,同时没有过度牺牲准确性。此外,我们的用户研究结果与定量评估一致,证明了新指标 DisWordRate 的合理性。