当前位置: X-MOL 学术ISPRS J. Photogramm. Remote Sens. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
LuoJiaHOG: A hierarchy oriented geo-aware image caption dataset for remote sensing image–text retrieval
ISPRS Journal of Photogrammetry and Remote Sensing ( IF 10.6 ) Pub Date : 2025-02-27 , DOI: 10.1016/j.isprsjprs.2025.02.009
Yuanxin Zhao , Mi Zhang , Bingnan Yang , Zhan Zhang , Jujia Kang , Jianya Gong

Image–text retrieval (ITR) is crucial for making informed decisions in various remote sensing (RS) applications, including urban development and disaster prevention. However, creating ITR datasets that combine vision and language modalities requires extensive geo-spatial sampling, diverse categories, and detailed descriptions. To address these needs, we introduce the LuojiaHOG dataset, which is geospatially aware, label-extension-friendly, and features comprehensive captions. LuojiaHOG incorporates hierarchical spatial sampling, an extensible classification system aligned with Open Geospatial Consortium (OGC) standards, and detailed caption generation. Additionally, we propose a CLIP-based Image Semantic Enhancement Network (CISEN) to enhance sophisticated ITR capabilities. CISEN comprises dual-path knowledge transfer and progressive cross-modal feature fusion. The former transfers multimodal knowledge from a large, pretrained CLIP-like model, while the latter enhances visual-to-text alignment and fine-grained cross-modal feature integration. Comprehensive statistics on LuojiaHOG demonstrate its richness in sampling diversity, label quantity, and description granularity. Evaluations of LuojiaHOG using various state-of-the-art ITR models–including ALBEF, ALIGN, CLIP, FILIP, Wukong, GeoRSCLIP, and CISEN-employ second- and third-level labels. Adapter-tuning shows that CISEN outperforms others, achieving the highest scores with WMAP@5 rates of 88.47% and 87.28% on third-level ITR tasks, respectively. Moreover, CISEN shows improvements of approximately 1.3% and 0.9% in WMAP@5 over its baseline. When tested on previous RS ITR benchmarks, CISEN achieves performance close to the state-of-the-art methods. Pretraining on LuojiaHOG can further enhance retrieval results. These findings underscore the advancements of CISEN in accurately retrieving relevant information across images and texts. LuojiaHOG and CISEN can serve as foundational resources for future research on RS image–text alignment, supporting a broad spectrum of vision-language applications. The retrieval demo and dataset are available at:https://huggingface.co/spaces/aleo1/LuojiaHOG-demo.

中文翻译:


LuoJiaHOG: 用于遥感图像文本检索的面向层次的地理感知图像标题数据集



图像文本检索 (ITR) 对于在各种遥感 (RS) 应用中做出明智的决策至关重要,包括城市发展和灾害预防。然而,创建结合视觉和语言模态的 ITR 数据集需要广泛的地理空间采样、多样化的类别和详细的描述。为了满足这些需求,我们引入了 LuojiaHOG 数据集,该数据集具有地理空间感知能力、标签扩展友好性,并且具有全面的标题。LuojiaHOG 结合了分层空间采样、符合开放地理空间联盟 (OGC) 标准的可扩展分类系统以及详细的标题生成。此外,我们提出了一种基于 CLIP 的图像语义增强网络 (CISEN) 来增强复杂的 ITR 功能。CISEN 包括双路径知识转移和渐进式跨模态特征融合。前者从大型预训练的类似 CLIP 的模型中传输多模态知识,而后者增强了视觉到文本的对齐和精细的跨模态特征集成。罗家猪的全面统计数据表明了其在采样多样性、标签数量和描述颗粒度方面的丰富性。使用各种最先进的 ITR 模型(包括 ALBEF、ALIGN、CLIP、FILIP、Wukong、GeoRSCLIP 和 CISEN)对 LuojiaHOG 进行评估,采用二级和三级标签。适配器调优表明,CISEN 的表现优于其他算法,在三级 ITR 任务中分别以 88.47% 和 87.28% 的WMAP@5率获得最高分。此外,CISEN 显示WMAP@5比基线分别提高了约 1.3% 和 0.9%。当在以前的 RS ITR 基准测试中进行测试时,CISEN 达到的性能接近最先进的方法。在 LuojiaHOG 上进行预训练可以进一步提高检索结果。 这些发现强调了 CISEN 在准确检索图像和文本中的相关信息方面的进步。LuojiaHOG 和 CISEN 可以作为未来 RS 图像-文本对齐研究的基础资源,支持广泛的视觉-语言应用。检索演示和数据集可在以下网址获得:https://huggingface.co/spaces/aleo1/LuojiaHOG-demo。
更新日期:2025-02-27
down
wechat
bug