TeachText: CrossModal text-video retrieval through generalized distillation,Artificial Intelligence

当前位置： X-MOL 学术 › Artif. Intell. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

TeachText: CrossModal text-video retrieval through generalized distillation
Artificial Intelligence ( IF 5.1 ) Pub Date : 2024-10-30 , DOI: 10.1016/j.artint.2024.104235
Ioana Croitoru, Simion-Vlad Bogolin, Marius Leordeanu, Hailin Jin, Andrew Zisserman, Yang Liu, Samuel Albanie

In recent years, considerable progress on the task of text-video retrieval has been achieved by leveraging large-scale pretraining on visual and audio datasets to construct powerful video encoders. By contrast, despite the natural symmetry, the design of effective algorithms for exploiting large-scale language pretraining remains under-explored. In this work, we investigate the design of such algorithms and propose a novel generalized distillation method, TeachText, which leverages complementary cues from multiple text encoders to provide an enhanced supervisory signal to the retrieval model. TeachText yields significant gains on a number of video retrieval benchmarks without incurring additional computational overhead during inference and was used to produce the winning entry in the Condensed Movie Challenge at ICCV 2021. We show how TeachText can be extended to include multiple video modalities, reducing computational cost at inference without compromising performance. Finally, we demonstrate the application of our method to the task of removing noisy descriptions from the training partitions of retrieval datasets to improve performance. Code and data can be found at https://www.robots.ox.ac.uk/~vgg/research/teachtext/.

中文翻译：

TeachText：通过广义蒸馏进行跨模态文本视频检索

近年来，通过利用视觉和音频数据集的大规模预训练来构建强大的视频编码器，在文本视频检索任务上取得了相当大的进展。相比之下，尽管存在自然对称性，但用于利用大规模语言预训练的有效算法的设计仍未得到充分探索。在这项工作中，我们研究了此类算法的设计，并提出了一种新颖的广义蒸馏方法 TeachText，它利用来自多个文本编码器的互补线索为检索模型提供增强的监督信号。TeachText 在许多视频检索基准测试中产生了显著的收益，而不会在推理过程中产生额外的计算开销，并用于在 ICCV 2021 的浓缩电影挑战赛中制作获胜作品。我们展示了如何扩展 TeachText 以包含多种视频模态，从而在不影响性能的情况下降低推理的计算成本。最后，我们演示了我们的方法在从检索数据集的训练分区中删除噪声描述以提高性能的任务中的应用。代码和数据可以在 https://www.robots.ox.ac.uk/~vgg/research/teachtext/ 中找到。

更新日期：2024-10-30

点击分享查看原文

点击收藏

阅读更多本刊新发论文本刊介绍/投稿指南