Learning Combinatorial Prompts for Universal Controllable Image Captioning,International Journal of Computer Vision

当前位置： X-MOL 学术 › Int. J. Comput. Vis. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Learning Combinatorial Prompts for Universal Controllable Image Captioning
International Journal of Computer Vision ( IF 11.6 ) Pub Date : 2024-07-22 , DOI: 10.1007/s11263-024-02179-4
Zhen Wang , Jun Xiao , Yueting Zhuang , Fei Gao , Jian Shao , Long Chen

Controllable Image Captioning (CIC)—generating natural language descriptions about images under the guidance of given control signals—is one of the most promising directions toward next-generation captioning systems. Till now, various kinds of control signals for CIC have been proposed, ranging from content-related control to structure-related control. However, due to the format and target gaps of different control signals, all existing CIC works (or architectures) only focus on one certain control signal, and overlook the human-like combinatorial ability. By “combinatorial", we mean that our humans can easily meet multiple needs (or constraints) simultaneously when generating descriptions. To this end, we propose a novel prompt-based framework for CIC by learning Combinatorial Prompts, dubbed as ComPro. Specifically, we directly utilize a pretrained language model GPT-2 Radford et al. (OpenAI blog 1:9, 2019) as our language model, which can help to bridge the gap between different signal-specific CIC architectures. Then, we reformulate the CIC as a prompt-guide sentence generation problem, and propose a new lightweight prompt generation network to generate the combinatorial prompts for different kinds of control signals. For different control signals, we further design a new mask attention mechanism to realize the prompt-based CIC. Due to its simplicity, our ComPro can be further extended to more kinds of combined control signals by concatenating these prompts. Extensive experiments on two prevalent CIC benchmarks have verified the effectiveness and efficiency of our ComPro on both single and combined control signals.

中文翻译：

学习通用可控图像字幕的组合提示

可控图像字幕（CIC）——在给定控制信号的指导下生成有关图像的自然语言描述——是下一代字幕系统最有前途的方向之一。迄今为止，已经提出了各种CIC控制信号，从内容相关控制到结构相关控制。然而，由于不同控制信号的格式和目标差距，现有的所有CIC工作（或架构）都只关注某一特定控制信号，而忽视了类人的组合能力。通过“组合”，我们的意思是我们的人类在生成描述时可以轻松地同时满足多种需求（或约束）。为此，我们通过学习组合提示提出了一种新颖的基于提示的 CIC 框架，称为 ComPro。具体来说，我们直接利用预训练的语言模型 GPT-2 Radford 等人（OpenAI 博客 1:9, 2019）作为我们的语言模型，这可以帮助弥合不同信号特定的 CIC 架构之间的差距。提示引导句子生成问题，并提出了一种新的轻量级提示生成网络来为不同类型的控制信号生成组合提示，我们进一步设计了一种新的掩模注意机制来实现基于提示的CIC。由于其简单性，我们的 ComPro 可以通过连接这些提示进一步扩展到更多种类的组合控制信号。在两个流行的 CIC 基准上进行的大量实验已经验证了我们的 ComPro 对单个和组合控制信号的有效性和效率。

更新日期：2024-07-22

点击分享查看原文

点击收藏

阅读更多本刊新发论文本刊介绍/投稿指南