Complex & Intelligent Systems ( IF 5.0 ) Pub Date : 2024-08-19 , DOI: 10.1007/s40747-024-01571-4 Zanxi Ruan , Yingmei Wei , Yanming Guo , Yuxiang Xie
Most previous few-shot action recognition works tend to process video temporal and spatial features separately, resulting in insufficient extraction of comprehensive features. In this paper, a novel hybrid attentive prototypical network (HAPN) framework for few-shot action recognition is proposed. Distinguished by its joint processing of temporal and spatial information, the HAPN framework strategically manipulates these dimensions from feature extraction to the attention module, consequently enhancing its ability to perform action recognition tasks. Our framework utilizes the R(2+1)D backbone network, coupling the extraction of integrated temporal and spatial features to ensure a comprehensive understanding of video content. Additionally, our framework introduces the novel Residual Tri-dimensional Attention (ResTriDA) mechanism, specifically designed to augment feature information across the temporal, spatial, and channel dimensions. ResTriDA dynamically enhances crucial aspects of video features by amplifying significant channel-wise features for action distinction, accentuating spatial details vital for capturing the essence of actions within frames, and emphasizing temporal dynamics to capture movement over time. We further propose a prototypical attentive matching module (PAM) built on the concept of metric learning to resolve the overfitting issue common in few-shot tasks. We evaluate our HAPN framework on three classical few-shot action recognition datasets: Kinetics-100, UCF101, and HMDB51. The results indicate that our framework significantly outperformed state-of-the-art methods. Notably, the 1-shot task, demonstrated an increase of 9.8% in accuracy on UCF101 and improvements of 3.9% on HMDB51 and 12.4% on Kinetics-100. These gains confirm the robustness and effectiveness of our approach in leveraging limited data for precise action recognition.
中文翻译:
用于少量动作识别的混合注意力原型网络
以往的少镜头动作识别工作大多倾向于分别处理视频的时间和空间特征,导致综合特征提取不足。在本文中,提出了一种用于少镜头动作识别的新型混合注意原型网络(HAPN)框架。 HAPN 框架的特点是联合处理时间和空间信息,从特征提取到注意力模块,策略性地操纵这些维度,从而增强其执行动作识别任务的能力。我们的框架利用 R(2+1)D 主干网络,耦合集成时间和空间特征的提取,以确保对视频内容的全面理解。此外,我们的框架引入了新颖的残差三维注意力(ResTriDA)机制,专门设计用于增强跨时间、空间和通道维度的特征信息。 ResTriDA 通过放大重要的通道特征来动态增强视频功能的关键方面,以区分动作,强调对于捕捉帧内动作本质至关重要的空间细节,并强调时间动态以捕捉随时间变化的运动。我们进一步提出了一种基于度量学习概念的原型注意力匹配模块(PAM),以解决小样本任务中常见的过度拟合问题。我们在三个经典的少样本动作识别数据集上评估我们的 HAPN 框架:Kinetics-100、UCF101 和 HMDB51。结果表明,我们的框架明显优于最先进的方法。值得注意的是,1-shot 任务在 UCF101 上的准确度提高了 9.8%,在 HMDB51 上提高了 3.9%,在 Kinetics-100 上提高了 12.4%。 这些成果证实了我们的方法在利用有限数据进行精确动作识别方面的稳健性和有效性。