Unified Frequency-Assisted Transformer Framework for Detecting and Grounding Multi-modal Manipulation,International Journal of Computer Vision

当前位置： X-MOL 学术 › Int. J. Comput. Vis. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Unified Frequency-Assisted Transformer Framework for Detecting and Grounding Multi-modal Manipulation
International Journal of Computer Vision ( IF 11.6 ) Pub Date : 2024-10-07 , DOI: 10.1007/s11263-024-02245-x
Huan Liu, Zichang Tan, Qiang Chen, Yunchao Wei, Yao Zhao, Jingdong Wang

Detecting and grounding multi-modal media manipulation (\(\hbox {DGM}^4\)) has become increasingly crucial due to the widespread dissemination of face forgery and text misinformation. In this paper, we present the Unified Frequency-Assisted transFormer framework, named UFAFormer, to address the \(\hbox {DGM}^4\) problem. Unlike previous state-of-the-art methods that solely focus on the image (RGB) domain to describe visual forgery features, we additionally introduce the frequency domain as a complementary viewpoint. By leveraging the discrete wavelet transform, we decompose images into several frequency sub-bands, capturing rich face forgery artifacts. Then, our proposed frequency encoder, incorporating intra-band and inter-band self-attentions, explicitly aggregates forgery features within and across diverse sub-bands. Moreover, to address the semantic conflicts between image and frequency domains, the forgery-aware mutual module is developed to further enable the effective interaction of disparate image and frequency features, resulting in aligned and comprehensive visual forgery representations. Finally, based on visual and textual forgery features, we propose a unified decoder that comprises two symmetric cross-modal interaction modules responsible for gathering modality-specific forgery information, along with a fusing interaction module for aggregation of both modalities. The proposed unified decoder formulates our UFAFormer as a unified framework, ultimately simplifying the overall architecture and facilitating the optimization process. Experimental results on the \(\hbox {DGM}^4\) dataset, containing several perturbations, demonstrate the superior performance of our framework compared to previous methods, setting a new benchmark in the field.

中文翻译：

用于检测和接地多模态操纵的统一频率辅助变压器框架

由于人脸伪造和文本错误信息的广泛传播，检测和基础多模式媒体操纵（ \(\hbox {DGM}^4\) ）变得越来越重要。在本文中，我们提出了统一频率辅助 transFormer 框架，名为 UFAFormer，来解决\(\hbox {DGM}^4\)问题。与之前仅关注图像（RGB）域来描述视觉伪造特征的最先进方法不同，我们另外引入频域作为补充观点。通过利用离散小波变换，我们将图像分解为几个频率子带，捕获丰富的人脸伪造伪影。然后，我们提出的频率编码器结合了带内和带间自注意力，明确地聚合了不同子带内和跨子带的伪造特征。此外，为了解决图像和频域之间的语义冲突，开发了伪造感知交互模块，以进一步实现不同图像和频率特征的有效交互，从而产生一致且全面的视觉伪造表示。最后，基于视觉和文本伪造特征，我们提出了一个统一的解码器，包括两个对称的跨模态交互模块，负责收集特定模态的伪造信息，以及一个用于聚合两种模态的融合交互模块。所提出的统一解码器将我们的 UFAFormer 制定为统一框架，最终简化了整体架构并促进了优化过程。在包含多个扰动的\(\hbox {DGM}^4\)数据集上的实验结果证明了我们的框架与以前的方法相比具有优越的性能，在该领域树立了新的基准。

更新日期：2024-10-08

点击分享查看原文

点击收藏

阅读更多本刊新发论文本刊介绍/投稿指南