SingleADV: Single-Class Target-Specific Attack Against Interpretable Deep Learning Systems,IEEE Transactions on Information Forensics and Security

当前位置： X-MOL 学术 › IEEE Trans. Inform. Forensics Secur. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

SingleADV: Single-Class Target-Specific Attack Against Interpretable Deep Learning Systems
IEEE Transactions on Information Forensics and Security ( IF 6.3 ) Pub Date : 2024-05-30 , DOI: 10.1109/tifs.2024.3407652
Eldor Abdukhamidov ₁ , Mohammed Abuhamad ₂ , George K. Thiruvathukal ₂ , Hyoungshick Kim ₁ , Tamer Abuhmed ₁

Affiliation

Establishing trust and helping experts debug and understand the inner workings of deep learning models, interpretation methods are increasingly coupled with these models, building interpretable deep learning systems. However, adversarial attacks pose a significant threat to public trust by making interpretations of deep learning models confusing and difficult to understand. In this paper, we present a novel Single-class target-specific ADVersarial attack called SingleADV. The goal of SingleADV is to generate a universal perturbation that deceives the target model into confusing a specific category of objects with a target category while ensuring highly relevant and accurate interpretations. The universal perturbation is stochastically and iteratively optimized by minimizing the adversarial loss that is designed to consider both the classifier and interpreter costs in targeted and non-targeted categories. In this optimization framework, ruled by the first- and second-moment estimations, the desired loss surface promotes high confidence and interpretation scores of adversarial samples. By avoiding unintended misclassification of samples from other categories, SingleADV enables more effective targeted attacks on interpretable deep learning systems in both white-box and black-box scenarios. To evaluate the effectiveness of SingleADV, we conduct experiments using four different model architectures (ResNet-50, VGG-16, DenseNet-169, and Inception-V3) coupled with three interpretation models (CAM, Grad, and MASK). Through extensive empirical evaluation, we demonstrate that SingleADV effectively deceives target deep learning models and their associated interpreters under various conditions and settings. Our results show that the performance of SingleADV is effective, with an average attack success rate of 74% and prediction confidence exceeding 77% on successful adversarial samples. Furthermore, we discuss several countermeasures against SingleADV, including a transfer-based learning approach and existing preprocessing defenses.

中文翻译：

SingleADV：针对可解释深度学习系统的单类特定目标攻击

建立信任并帮助专家调试和理解深度学习模型的内部运作，解释方法越来越多地与这些模型结合，构建可解释的深度学习系统。然而，对抗性攻击使深度学习模型的解释变得混乱且难以理解，从而对公众信任构成重大威胁。在本文中，我们提出了一种新颖的单类特定目标 ADVersarial 攻击，称为 SingleADV。 SingleADV 的目标是生成一种通用扰动，欺骗目标模型将特定类别的对象与目标类别混淆，同时确保高度相关和准确的解释。通过最小化对抗性损失来随机和迭代地优化通用扰动，该损失旨在考虑目标和非目标类别中的分类器和解释器成本。在这个优化框架中，由一阶矩和二阶矩估计决定，所需的损失表面促进了对抗性样本的高置信度和解释分数。通过避免对其他类别的样本进行意外错误分类，SingleADV 能够在白盒和黑盒场景中对可解释的深度学习系统进行更有效的有针对性的攻击。为了评估 SingleADV 的有效性，我们使用四种不同的模型架构（ResNet-50、VGG-16、DenseNet-169 和 Inception-V3）以及三种解释模型（CAM、Grad 和 MASK）进行实验。通过广泛的实证评估，我们证明 SingleADV 在各种条件和设置下可以有效地欺骗目标深度学习模型及其相关解释器。我们的结果表明，SingleADV 的性能是有效的，在成功的对抗样本上平均攻击成功率为 74%，预测置信度超过 77%。此外，我们还讨论了针对 SingleADV 的几种对策，包括基于迁移的学习方法和现有的预处理防御措施。

更新日期：2024-05-30

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>