当前位置: X-MOL 学术IEEE Trans. Pattern Anal. Mach. Intell. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
STMixer: A One-Stage Sparse Action Detector
IEEE Transactions on Pattern Analysis and Machine Intelligence ( IF 20.8 ) Pub Date : 2024-04-10 , DOI: 10.1109/tpami.2024.3387127
Tao Wu 1 , Mengqi Cao 1 , Ziteng Gao 1 , Gangshan Wu 1 , Limin Wang 1
Affiliation  

Traditional video action detectors typically adopt the two-stage pipeline, where a person detector is first employed to generate actor boxes and then 3D RoIAlign is used to extract actor-specific features for action recognition. This detection paradigm requires multi-stage training and inference, and the feature sampling is only constrained inside the box, failing to effectively leverage richer context information outside. Recently, several query-based action detectors have been proposed to predict action instances in an end-to-end manner. However, they still lack adaptability in feature sampling and decoding, thus suffering from the issues of inferior performance or slower convergence. In this paper, we propose two core designs for a more flexible one-stage sparse action detector. First, we present a query-based adaptive feature sampling module, which endows the detector with the flexibility of mining a group of discriminative features from the entire spatio-temporal domain. Second, we devise a decoupled feature mixing module, which dynamically attends to and mixes video features along the spatial and temporal dimensions respectively for better feature decoding. Based on these designs, we instantiate two detection pipelines, that is, STMixer-K for keyframe action detection and STMixer-T for action tubelet detection. Without bells and whistles, our STMixer detectors obtain the state-of-the-art results on five challenging spatio-temporal action detection benchmarks for keyframe action detection or action tube detection.

中文翻译:


STMixer:一种单级稀疏动作检测器



传统的视频动作检测器通常采用两级管道,首先使用人物检测器生成演员框,然后使用 3D RoIAlign 提取演员特定的特征以进行动作识别。这种检测范式需要多阶段的训练和推理,并且特征采样仅限于框内,无法有效利用框外更丰富的上下文信息。最近,已经提出了几种基于查询的动作检测器来以端到端的方式预测动作实例。然而,它们在特征采样和解码方面仍然缺乏适应性,因此存在性能较差或收敛速度较慢的问题。在本文中,我们提出了两种核心设计,用于更灵活的一级稀疏动作检测器。首先,我们提出了一种基于查询的自适应特征采样模块,该模块赋予检测器从整个时空域挖掘一组判别性特征的灵活性。其次,我们设计了一个解耦的特征混合模块,它分别沿着空间和时间维度动态地处理和混合视频特征,以实现更好的特征解码。基于这些设计,我们实例化了两个检测管道,即用于关键帧动作检测的STMixer-K和用于动作小管检测的STMixer-T。我们的 STMixer 检测器无需花哨的功能,即可在关键帧动作检测或动作管检测的五个具有挑战性的时空动作检测基准上获得最先进的结果。
更新日期:2024-04-10
down
wechat
bug