Integrating synthetic datasets with CLIP semantic insights for single image localization advancements,ISPRS Journal of Photogrammetry and Remote Sensing

当前位置： X-MOL 学术 › ISPRS J. Photogramm. Remote Sens. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Integrating synthetic datasets with CLIP semantic insights for single image localization advancements
ISPRS Journal of Photogrammetry and Remote Sensing ( IF 10.6 ) Pub Date : 2024-11-06 , DOI: 10.1016/j.isprsjprs.2024.10.027
Dansheng Yao, Mengqi Zhu, Hehua Zhu, Wuqiang Cai, Long Zhou

Accurate localization of pedestrians and mobile robots is critical for navigation, emergency response, and autonomous driving. Traditional localization methods, such as satellite signals, often prove ineffective in certain environments, and acquiring sufficient positional data can be challenging. Single image localization techniques have been developed to address these issues. However, current deep learning frameworks for single image localization that rely on domain adaptation fail to effectively utilize semantically rich high-level features obtained from large-scale pretraining. This paper introduces a novel framework that leverages the Contrastive Language-Image Pre-training model and prompts to enhance feature extraction and domain adaptation through semantic information. The proposed framework generates an integrated score map from scene-specific prompts to guide feature extraction and employs adversarial components to facilitate domain adaptation. Furthermore, a reslink component is incorporated to mitigate the precision loss in high-level features compared to the original data. Experimental results demonstrate that the use of prompts reduces localization errors by 26.4 % in indoor environments and 24.3 % in outdoor settings. The model achieves localization errors as low as 0.75 m and 8.09 degrees indoors, and 4.56 m and 7.68 degrees outdoors. Analysis of prompts from labeled datasets confirms the model’s capability to effectively interpret scene information. The weights of the integrated score map enhance the model’s transparency, thereby improving interpretability. This study underscores the efficacy of integrating semantic information into image localization tasks.

中文翻译：

将合成数据集与 CLIP 语义洞察集成，以实现单张图像定位的进步

行人和移动机器人的准确定位对于导航、应急响应和自动驾驶至关重要。传统的定位方法（例如卫星信号）通常被证明在某些环境中无效，并且获取足够的位置数据可能具有挑战性。已经开发了单张图像定位技术来解决这些问题。然而，当前依赖于域自适应的单图像定位深度学习框架无法有效地利用从大规模预训练中获得的语义丰富的高级特征。本文介绍了一种新的框架，该框架利用对比语言-图像预训练模型和提示，通过语义信息增强特征提取和域适应。所提出的框架从特定于场景的提示中生成一个集成的分数图来指导特征提取，并采用对抗组件来促进域适应。此外，还加入了 reslink 组件，以减轻与原始数据相比高级特征的精度损失。实验结果表明，使用提示在室内环境中减少了 26.4% 的定位错误，在室外环境中减少了 24.3% 的定位错误。该模型在室内实现了低至 0.75 m 和 8.09 度的定位误差，在室外实现了 4.56 m 和 7.68 度的定位误差。对来自标记数据集的提示的分析证实了该模型有效解释场景信息的能力。集成评分图的权重增强了模型的透明度，从而提高了可解释性。本研究强调了将语义信息整合到图像定位任务中的有效性。

更新日期：2024-11-06

点击分享查看原文

点击收藏

阅读更多本刊新发论文本刊介绍/投稿指南