International Journal of Computer Vision ( IF 11.6 ) Pub Date : 2024-06-06 , DOI: 10.1007/s11263-024-02128-1 Qilin Yin , Wei Lu , Xiaochun Cao , Xiangyang Luo , Yicong Zhou , Jiwu Huang
Nowadays, the abuse of deepfakes is a well-known issue since deepfakes can lead to severe security and privacy problems. And this situation is getting worse, as attackers are no longer limited to unimodal deepfakes, but use multimodal deepfakes, i.e., both audio forgery and video forgery, to better achieve malicious purposes. The existing unimodal or ensemble deepfake detectors are demanded with fine-grained classification capabilities for the growing technique on multimodal deepfakes. To address this gap, we propose a graph attention network based on heterogeneous graph for fine-grained multimodal deepfake classification, i.e., not only distinguishing the authenticity of samples, but also identifying the forged types, e.g., video or audio or both. To this end, we propose a positional coding-based heterogeneous graph construction method that converts an audio-visual sample into a multimodal heterogeneous graph according to relevant hyperparameters. Moreover, a cross-modal graph interaction module is devised to utilize audio-visual synchronization patterns for capturing inter-modal complementary information. The de-homogenization graph pooling operation is elaborately designed to keep differences in graph node features for enhancing the representation of graph-level features. Through the heterogeneous graph attention network, we can efficiently model intra- and inter-modal relationships of multimodal data both at spatial and temporal scales. Extensive experimental results on two audio-visual datasets FakeAVCeleb and LAV-DF demonstrate that our proposed model obtains significant performance gains as compared to other state-of-the-art competitors. The code is available at https://github.com/yinql1995/Fine-grained-Multimodal-DeepFake-Classification/.
中文翻译:
通过异构图进行细粒度多模态 DeepFake 分类
如今,深度伪造品的滥用是一个众所周知的问题,因为深度伪造品可能导致严重的安全和隐私问题。而且这种情况越来越严重,攻击者不再局限于单模态 Deepfakes,而是使用多模态 Deepfakes,即音频伪造和视频伪造,来更好地达到恶意目的。现有的单模态或整体深度造假检测器需要具有细粒度的分类能力,以适应不断发展的多模态深度造假技术。为了解决这一差距,我们提出了一种基于异构图的图注意力网络,用于细粒度多模态深度伪造分类,即不仅可以区分样本的真实性,还可以识别伪造的类型,例如视频或音频或两者。为此,我们提出了一种基于位置编码的异构图构建方法,根据相关超参数将视听样本转换为多模态异构图。此外,设计了跨模态图形交互模块,以利用视听同步模式来捕获模间互补信息。去均匀化图池化操作经过精心设计,以保持图节点特征的差异,以增强图级特征的表示。通过异构图注意力网络,我们可以在空间和时间尺度上有效地建模多模态数据的模内和模间关系。对两个视听数据集 FakeAVCeleb 和 LAV-DF 的广泛实验结果表明,与其他最先进的竞争对手相比,我们提出的模型获得了显着的性能提升。该代码可在 https://github.com/yinql1995/Fine-grained-Multimodal-DeepFake-Classification/ 获取。