International Journal of Computer Vision ( IF 11.6 ) Pub Date : 2024-12-10 , DOI: 10.1007/s11263-024-02255-9 Xiao Guo, Xiaohong Liu, Iacopo Masi, Xiaoming Liu
Differences in forgery attributes of images generated in CNN-synthesized and image-editing domains are large, and such differences make a unified image forgery detection and localization (IFDL) challenging. To this end, we present a hierarchical fine-grained formulation for IFDL representation learning. Specifically, we first represent forgery attributes of a manipulated image with multiple labels at different levels. Then, we perform fine-grained classification at these levels using the hierarchical dependency between them. As a result, the algorithm is encouraged to learn both comprehensive features and the inherent hierarchical nature of different forgery attributes, thereby improving the IFDL representation. In this work, we propose a Language-guided Hierarchical Fine-grained IFDL, denoted as HiFi-Net++. Specifically, HiFi-Net++ contains four components: multi-branch feature extractor, language-guided forgery localization enhancer, as well as classification and localization modules. Each branch of the multi-branch feature extractor learns to classify forgery attributes at one level, while localization and classification modules segment the pixel-level forgery region and detect image-level forgery, respectively. In addition, the language-guided forgery localization enhancer (LFLE), containing image and text encoders learned by contrastive language-image pre-training (CLIP), is used to further enrich the IFDL representation. LFLE takes specifically designed texts and the given image as multi-modal inputs and then generates the visual embedding and manipulation score maps, which are used to further improve HiFi-Net++ manipulation localization performance. Lastly, we construct a hierarchical fine-grained dataset to facilitate our study. We demonstrate the effectiveness of our method on 8 different benchmarks for both tasks of IFDL and forgery attribute classification. Our source code and dataset can be found: github.com/CHELSEA234/HiFi-IFDL.
中文翻译:
语言引导的分层细粒度图像伪造检测和本地化
在 CNN 合成和图像编辑域中生成的图像的伪造属性差异很大,这种差异使统一图像伪造检测和定位 (IFDL) 具有挑战性。为此,我们提出了一种用于 IFDL 表示学习的分层细粒度公式。具体来说,我们首先表示具有不同级别多个标签的操纵图像的伪造属性。然后,我们利用它们之间的分层依赖关系在这些级别上执行细粒度分类。因此,鼓励算法学习不同伪造属性的综合特征和固有的层次结构性质,从而改进 IFDL 表示。在这项工作中,我们提出了一种语言引导的分层细粒度 IFDL,表示为 HiFi-Net++。具体来说,HiFi-Net++ 包含四个组件:多分支特征提取器、语言引导的伪造本地化增强器以及分类和定位模块。多分支特征提取器的每个分支都学习在一个级别对伪造属性进行分类,而定位和分类模块分别分割像素级伪造区域和检测图像级伪造。此外,语言引导的伪造定位增强器 (LFLE) 包含通过对比语言-图像预训练 (CLIP) 学习的图像和文本编码器,用于进一步丰富 IFDL 表示。LFLE 将专门设计的文本和给定图像作为多模态输入,然后生成视觉嵌入和操作分数图,用于进一步提高 HiFi-Net++ 操作定位性能。最后,我们构建了一个分层的细粒度数据集来促进我们的研究。 我们在 IFDL 和伪造属性分类任务的 8 个不同基准上证明了我们的方法的有效性。我们的源代码和数据集可以在以下地址找到:github.com/CHELSEA234/HiFi-IFDL。