当前位置: X-MOL 学术Inform. Fusion › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
TextFusion: Unveiling the power of textual semantics for controllable image fusion
Information Fusion ( IF 14.7 ) Pub Date : 2024-11-19 , DOI: 10.1016/j.inffus.2024.102790
Chunyang Cheng, Tianyang Xu, Xiao-Jun Wu, Hui Li, Xi Li, Zhangyong Tang, Josef Kittler

Advanced image fusion techniques aim to synthesise fusion results by integrating the complementary information provided by the source inputs. However, the inherent differences in the way distinct modalities capture and represent the same scene pose significant challenges for designing a robust and controllable fusion process. We argue that incorporating high-level semantic information from the text modality can mitigate this issue, allowing the generation of text-conditional fused content (TextFusion) for enhanced visualisation or supporting downstream tasks, in a more controllable manner. To achieve this, we employ a vision-and-language model to establish a coarse-to-fine association mechanism. Leveraging the association maps, an affine fusion unit is introduced to seamlessly combine the text and visual modalities at the feature level. Additionally, we introduce a textual attention mechanism designed to refine conventional average-based image fusion evaluation metrics, which often fail to accurately capture the true quality of fusion outcomes. To encourage the wider adoption of our controllable image fusion framework, we release a publicly available text-annotated image fusion dataset, IVT. Extensive experiments demonstrate that our approach consistently outperforms traditional appearance-based fusion methods in both subjective and objective evaluations. Our code and dataset are publicly available at https://github.com/AWCXV/TextFusion.

中文翻译:


TextFusion:揭示文本语义在可控图像融合中的力量



高级图像融合技术旨在通过整合源输入提供的互补信息来合成融合结果。然而,不同模态捕获和表示同一场景的方式的固有差异对设计稳健且可控的融合过程构成了重大挑战。我们认为,从文本模态中整合高级语义信息可以缓解这个问题,允许生成文本条件融合内容 (TextFusion) 以更可控的方式增强可视化或支持下游任务。为此,我们采用视觉和语言模型来建立从粗到细的关联机制。利用关联映射,引入了仿射融合单元,以在特征级别无缝组合文本和视觉模态。此外,我们引入了一种文本注意力机制,旨在改进传统的基于平均值的图像融合评估指标,这些指标通常无法准确捕捉融合结果的真实质量。为了鼓励更广泛地采用我们的可控图像融合框架,我们发布了一个公开可用的文本注释图像融合数据集 IVT。广泛的实验表明,我们的方法在主观和客观评估方面始终优于传统的基于外观的融合方法。我们的代码和数据集在 https://github.com/AWCXV/TextFusion 上公开提供。
更新日期:2024-11-19
down
wechat
bug