International Journal of Computer Vision ( IF 11.6 ) Pub Date : 2024-10-22 , DOI: 10.1007/s11263-024-02202-8 Lucas Ventura, Cordelia Schmid, Gül Varol
We describe a protocol to study text-to-video retrieval training with unlabeled videos, where we assume (i) no access to labels for any videos, i.e., no access to the set of ground-truth captions, but (ii) access to labeled images in the form of text. Using image expert models is a realistic scenario given that annotating images is cheaper therefore scalable, in contrast to expensive video labeling schemes. Recently, zero-shot image experts such as CLIP have established a new strong baseline for video understanding tasks. In this paper, we make use of this progress and instantiate the image experts from two types of models: a text-to-image retrieval model to provide an initial backbone, and image captioning models to provide supervision signal into unlabeled videos. We show that automatically labeling video frames with image captioning allows text-to-video retrieval training. This process adapts the features to the target domain at no manual annotation cost, consequently outperforming the strong zero-shot CLIP baseline. During training, we sample captions from multiple video frames that best match the visual content, and perform a temporal pooling over frame representations by scoring frames according to their relevance to each caption. We conduct extensive ablations to provide insights and demonstrate the effectiveness of this simple framework by outperforming the CLIP zero-shot baselines on text-to-video retrieval on three standard datasets, namely ActivityNet, MSR-VTT, and MSVD. Code and models will be made publicly available.
中文翻译:
从图像字幕中学习文本到视频检索
我们描述了一种协议,用于研究未标记视频的文本到视频检索训练,其中我们假设 (i) 无法访问任何视频的标签,即无法访问一组真实字幕,但 (ii) 访问文本形式的标记图像。使用图像专家模型是一个现实的场景,因为与昂贵的视频标记方案相比,注释图像更便宜,因此可扩展。最近,CLIP 等零镜头图像专家为视频理解任务建立了一个新的强大基线。在本文中,我们利用这一进展,从两种类型的模型中实例化了图像专家:提供初始主干的文本到图像检索模型,以及为未标记视频提供监督信号的图像描述模型。我们表明,使用图像字幕自动标记视频帧可以进行文本到视频检索训练。此过程使特征适应目标域,无需手动注释成本,因此性能优于强大的零样本 CLIP 基线。在训练期间,我们从与视觉内容最匹配的多个视频帧中采样字幕,并通过根据帧与每个字幕的相关性对帧表示进行临时池化。我们进行了广泛的消融,通过在三个标准数据集(即 ActivityNet、MSR-VTT 和 MSVD)上超越 CLIP 文本到视频检索的 CLIP 零镜头基线来提供见解并证明这个简单框架的有效性。代码和模型将公开提供。