International Journal of Computer Vision ( IF 11.6 ) Pub Date : 2024-11-04 , DOI: 10.1007/s11263-024-02272-8 Tianyao He, Huabin Liu, Zelin Ni, Yuxi Li, Xiao Ma, Cheng Zhong, Yang Zhang, Yingxue Wang, Weiyao Lin
Video Correlation Learning (VCL) delineates a high-level research domain that centers on analyzing the semantic and temporal correspondences between videos through a comparative paradigm. Recently, instructional video-related tasks have drawn increasing attention due to their promising potential. Compared with general videos, instructional videos possess more complex procedure information, making correlation learning quite challenging. To obtain procedural knowledge, current methods rely heavily on fine-grained step-level annotations, which are costly and non-scalable. To improve VCL on instructional videos, we introduce a weakly supervised framework named Collaborative Procedure Alignment (CPA). To be specific, our framework comprises two core components: the collaborative step mining (CSM) module and the frame-to-step alignment (FSA) module. Free of the necessity for step-level annotations, the CSM module can properly conduct temporal step segmentation and pseudo-step learning by exploring the inner procedure correspondences between paired videos. Subsequently, the FSA module efficiently yields the probability of aligning one video’s frame-level features with another video’s pseudo-step labels, which can act as a reliable correlation degree for paired videos. The two modules are inherently interconnected and can mutually enhance each other to extract the step-level knowledge and measure the video correlation distances accurately. Our framework provides an effective tool for instructional video correlation learning. We instantiate our framework on four representative tasks, including sequence verification, few-shot action recognition, temporal action segmentation, and action quality assessment. Furthermore, we extend our framework to more innovative functions to further exhibit its potential. Extensive and in-depth experiments validate CPA’s strong correlation learning capability on instructional videos. The implementation can be found at https://github.com/hotelll/Collaborative_Procedure_Alignment.
中文翻译:
从协作角度实现弱监督下的过程感知教学视频关联学习
视频相关学习 (VCL) 描述了一个高级研究领域,其中心是通过比较范式分析视频之间的语义和时间对应关系。最近,与教学视频相关的任务因其广阔的潜力而受到越来越多的关注。与一般视频相比,教学视频拥有更复杂的程序信息,这使得相关性学习相当具有挑战性。为了获得过程知识,当前方法严重依赖细粒度的 step level annotations,这些注释成本高昂且不可扩展。为了改进教学视频的 VCL,我们引入了一个名为 Collaborative Procedure Alignment (CPA) 的弱监督框架。具体来说,我们的框架包括两个核心组件:协作步骤挖掘 (CSM) 模块和帧到步骤对齐 (FSA) 模块。CSM 模块无需步骤级标注,可以通过探索配对视频之间的内部过程对应关系,正确地进行时间步骤分割和伪步骤学习。随后,FSA 模块有效地产生了将一个视频的帧级特征与另一个视频的伪步长标签对齐的概率,这可以作为配对视频的可靠相关度。这两个模块本质上是相互关联的,可以相互增强,以提取阶级知识并准确测量视频相关距离。我们的框架为教学视频关联学习提供了一个有效的工具。我们在四个代表性任务上实例化了我们的框架,包括序列验证、小镜头动作识别、时间动作分割和动作质量评估。 此外,我们将框架扩展到更多创新功能,以进一步展示其潜力。广泛而深入的实验验证了 CPA 在教学视频上强大的关联学习能力。可以在 https://github.com/hotelll/Collaborative_Procedure_Alignment 中找到实现。