Fine-Grained Multimodal DeepFake Classification via Heterogeneous Graphs,International Journal of Computer Vision

当前位置： X-MOL 学术 › Int. J. Comput. Vis. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Fine-Grained Multimodal DeepFake Classification via Heterogeneous Graphs
International Journal of Computer Vision ( IF 11.6 ) Pub Date : 2024-06-06 , DOI: 10.1007/s11263-024-02128-1
Qilin Yin , Wei Lu , Xiaochun Cao , Xiangyang Luo , Yicong Zhou , Jiwu Huang

Nowadays, the abuse of deepfakes is a well-known issue since deepfakes can lead to severe security and privacy problems. And this situation is getting worse, as attackers are no longer limited to unimodal deepfakes, but use multimodal deepfakes, i.e., both audio forgery and video forgery, to better achieve malicious purposes. The existing unimodal or ensemble deepfake detectors are demanded with fine-grained classification capabilities for the growing technique on multimodal deepfakes. To address this gap, we propose a graph attention network based on heterogeneous graph for fine-grained multimodal deepfake classification, i.e., not only distinguishing the authenticity of samples, but also identifying the forged types, e.g., video or audio or both. To this end, we propose a positional coding-based heterogeneous graph construction method that converts an audio-visual sample into a multimodal heterogeneous graph according to relevant hyperparameters. Moreover, a cross-modal graph interaction module is devised to utilize audio-visual synchronization patterns for capturing inter-modal complementary information. The de-homogenization graph pooling operation is elaborately designed to keep differences in graph node features for enhancing the representation of graph-level features. Through the heterogeneous graph attention network, we can efficiently model intra- and inter-modal relationships of multimodal data both at spatial and temporal scales. Extensive experimental results on two audio-visual datasets FakeAVCeleb and LAV-DF demonstrate that our proposed model obtains significant performance gains as compared to other state-of-the-art competitors. The code is available at https://github.com/yinql1995/Fine-grained-Multimodal-DeepFake-Classification/.

中文翻译：

通过异构图进行细粒度多模态 DeepFake 分类

如今，深度伪造品的滥用是一个众所周知的问题，因为深度伪造品可能导致严重的安全和隐私问题。而且这种情况越来越严重，攻击者不再局限于单模态 Deepfakes，而是使用多模态 Deepfakes，即音频伪造和视频伪造，来更好地达到恶意目的。现有的单模态或整体深度造假检测器需要具有细粒度的分类能力，以适应不断发展的多模态深度造假技术。为了解决这一差距，我们提出了一种基于异构图的图注意力网络，用于细粒度多模态深度伪造分类，即不仅可以区分样本的真实性，还可以识别伪造的类型，例如视频或音频或两者。为此，我们提出了一种基于位置编码的异构图构建方法，根据相关超参数将视听样本转换为多模态异构图。此外，设计了跨模态图形交互模块，以利用视听同步模式来捕获模间互补信息。去均匀化图池化操作经过精心设计，以保持图节点特征的差异，以增强图级特征的表示。通过异构图注意力网络，我们可以在空间和时间尺度上有效地建模多模态数据的模内和模间关系。对两个视听数据集 FakeAVCeleb 和 LAV-DF 的广泛实验结果表明，与其他最先进的竞争对手相比，我们提出的模型获得了显着的性能提升。该代码可在 https://github.com/yinql1995/Fine-grained-Multimodal-DeepFake-Classification/ 获取。

更新日期：2024-06-07

点击分享查看原文

点击收藏

阅读更多本刊新发论文本刊介绍/投稿指南