Complex & Intelligent Systems ( IF 5.0 ) Pub Date : 2024-11-13 , DOI: 10.1007/s40747-024-01614-w Xue Bo, Junjie Liu, Di Yang, Wentao Ma
Text-based cross-modal vehicle retrieval has been widely applied in smart city contexts and other scenarios. The objective of this approach is to identify semantically relevant target vehicles in videos using text descriptions, thereby facilitating the analysis of vehicle spatio-temporal trajectories. Current methodologies predominantly employ a two-tower architecture, where single-granularity features from both visual and textual domains are extracted independently. However, due to the intricate semantic relationships between videos and text, aligning the two modalities effectively using single-granularity feature representation poses a challenge. To address this issue, we introduce a Multi-Granularity Representation Learning model, termed MGRL, tailored for text-based cross-modal vehicle retrieval. Specifically, the model parses information from the two modalities into three hierarchical levels of feature representation: coarse-granularity, medium-granularity, and fine-granularity. Subsequently, a feature adaptive fusion strategy is devised to automatically determine the optimal pooling mechanism. Finally, a multi-granularity contrastive learning approach is implemented to ensure comprehensive semantic coverage, ranging from coarse to fine levels. Experimental outcomes on public benchmarks show that our method achieves up to a 14.56% improvement in text-to-vehicle retrieval performance, as measured by the Mean Reciprocal Rank (MRR) metric, when compared against 10 state-of-the-art baselines and 6 ablation studies.
中文翻译:
弥合差距:基于文本的车辆检索的多粒度表示学习
基于文本的跨模态车辆检索已广泛应用于智慧城市环境等场景。这种方法的目标是使用文本描述识别视频中语义相关的目标车辆,从而促进车辆时空轨迹的分析。当前的方法主要采用双塔架构,其中来自视觉和文本领域的单粒度特征是独立提取的。然而,由于视频和文本之间错综复杂的语义关系,使用单粒度特征表示有效地对齐这两种模态是一个挑战。为了解决这个问题,我们引入了一种 Multi-G ranularity Representation L收益模型,称为 MGRL,专为基于文本的跨模态车辆检索量身定制。具体来说,该模型将来自两种模态的信息解析为特征表示的三个分层级别:粗粒度、中粒度和细粒度。随后,设计了一种特征自适应融合策略来自动确定最佳池化机制。最后,实施了一种多粒度对比学习方法,以确保从粗略到精细的全面语义覆盖。公共基准的实验结果表明,与 10 项最先进的基线和 6 项消融研究相比,我们的方法在文本到车辆检索性能方面实现了高达 14.56% 的改进,这是通过平均倒数秩 (MRR) 指标衡量的。