GSSF: Generalized Structural Sparse Function for Deep Cross-Modal Metric Learning,IEEE Transactions on Image Processing

当前位置： X-MOL 学术 › IEEE Trans. Image Process. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

GSSF: Generalized Structural Sparse Function for Deep Cross-Modal Metric Learning
IEEE Transactions on Image Processing ( IF 10.8 ) Pub Date : 2024-10-31 , DOI: 10.1109/tip.2024.3485498
Haiwen Diao, Ying Zhang, Shang Gao, Jiawen Zhu, Long Chen, Huchuan Lu

Cross-modal metric learning is a prominent research topic that bridges the semantic heterogeneity between vision and language. Existing methods frequently utilize simple cosine or complex distance metrics to transform the pairwise features into a similarity score, which suffers from an inadequate or inefficient capability for distance measurements. Consequently, we propose a Generalized Structural Sparse Function to dynamically capture thorough and powerful relationships across modalities for pair-wise similarity learning while remaining concise but efficient. Specifically, the distance metric delicately encapsulates two formats of diagonal and block-diagonal terms, automatically distinguishing and highlighting the cross-channel relevancy and dependency inside a structured and organized topology. Hence, it thereby empowers itself to adapt to the optimal matching patterns between the paired features and reaches a sweet spot between model complexity and capability. Extensive experiments on cross-modal and two extra uni-modal retrieval tasks (image-text retrieval, person re-identification, fine-grained image retrieval) have validated its superiority and flexibility over various popular retrieval frameworks. More importantly, we further discover that it can be seamlessly incorporated into multiple application scenarios, and demonstrates promising prospects from Attention Mechanism to Knowledge Distillation in a plug-and-play manner.

中文翻译：

GSSF：用于深度跨模态度量学习的广义结构稀疏函数

跨模态度量学习是一个突出的研究课题，它弥合了视觉和语言之间的语义异质性。现有方法经常使用简单余弦或复距离度量将成对特征转换为相似性分数，而相似性分数存在距离测量能力不足或效率低下的问题。因此，我们提出了一个广义结构稀疏函数，以动态捕获跨模态的全面而强大的关系，以进行成对的相似性学习，同时保持简洁而高效。具体来说，距离度量巧妙地封装了对角线和块对角线术语两种格式，在结构化和有组织的拓扑中自动区分和突出显示跨通道相关性和依赖性。因此，它使自己能够适应配对特征之间的最佳匹配模式，并在模型复杂性和功能之间达到最佳平衡点。对跨模态和两个额外的单模态检索任务（图像文本检索、人物重识别、细粒度图像检索）的广泛实验验证了其在各种流行的检索框架中的优越性和灵活性。更重要的是，我们进一步发现它可以无缝地融入多个应用场景，并以即插即用的方式展示了从 Attention Mechanism 到 Knowledge Distillation 的前景。

更新日期：2024-10-31

点击分享查看原文

点击收藏

阅读更多本刊新发论文本刊介绍/投稿指南