FROSTER: Frozen CLIP Is A Strong Teacher for Open-Vocabulary Action Recognition,arXiv - CS - Machine Learning

当前位置： X-MOL 学术 › arXiv.cs.LG › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

FROSTER: Frozen CLIP Is A Strong Teacher for Open-Vocabulary Action Recognition
arXiv - CS - Machine Learning Pub Date : 2024-02-05 , DOI: arxiv-2402.03241
Xiaohu Huang, Hao Zhou, Kun Yao, Kai Han

In this paper, we introduce FROSTER, an effective framework for open-vocabulary action recognition. The CLIP model has achieved remarkable success in a range of image-based tasks, benefiting from its strong generalization capability stemming from pretaining on massive image-text pairs. However, applying CLIP directly to the open-vocabulary action recognition task is challenging due to the absence of temporal information in CLIP's pretraining. Further, fine-tuning CLIP on action recognition datasets may lead to overfitting and hinder its generalizability, resulting in unsatisfactory results when dealing with unseen actions. To address these issues, FROSTER employs a residual feature distillation approach to ensure that CLIP retains its generalization capability while effectively adapting to the action recognition task. Specifically, the residual feature distillation treats the frozen CLIP model as a teacher to maintain the generalizability exhibited by the original CLIP and supervises the feature learning for the extraction of video-specific features to bridge the gap between images and videos. Meanwhile, it uses a residual sub-network for feature distillation to reach a balance between the two distinct objectives of learning generalizable and video-specific features. We extensively evaluate FROSTER on open-vocabulary action recognition benchmarks under both base-to-novel and cross-dataset settings. FROSTER consistently achieves state-of-the-art performance on all datasets across the board. Project page: https://visual-ai.github.io/froster.

中文翻译：

FROSTER：Frozen CLIP 是开放词汇动作识别的强大老师

在本文中，我们介绍了 FROSTER，一个用于开放词汇动作识别的有效框架。 CLIP 模型在一系列基于图像的任务中取得了显着的成功，这得益于其预保留大量图像文本对所带来的强大泛化能力。然而，由于 CLIP 预训练中缺乏时间信息，将 CLIP 直接应用于开放词汇动作识别任务具有挑战性。此外，在动作识别数据集上微调 CLIP 可能会导致过度拟合并阻碍其泛化性，从而在处理看不见的动作时导致结果不令人满意。为了解决这些问题，FROSTER采用残差特征蒸馏方法来确保CLIP保留其泛化能力，同时有效适应动作识别任务。具体来说，残余特征蒸馏将冻结的 CLIP 模型视为老师，以保持原始 CLIP 所表现出的通用性，并监督特征学习以提取视频特定特征，以弥合图像和视频之间的差距。同时，它使用残差子网络进行特征蒸馏，以在学习可泛化特征和视频特定特征这两个不同目标之间达到平衡。我们在基础到小说和跨数据集设置下，在开放词汇动作识别基准上对 FROSTER 进行了广泛评估。 FROSTER 在所有数据集上始终保持最先进的性能。项目页面：https://visual-ai.github.io/froster。

更新日期：2024-02-06

点击分享查看原文

点击收藏

阅读更多本刊最新论文