ChatEarthNet: A Global-Scale Image-Text Dataset Empowering Vision-Language Geo-Foundation Models,Earth System Science Data

当前位置： X-MOL 学术 › Earth Syst. Sci. Data › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

ChatEarthNet: A Global-Scale Image-Text Dataset Empowering Vision-Language Geo-Foundation Models
Earth System Science Data ( IF 11.2 ) Pub Date : 2024-06-27 , DOI: 10.5194/essd-2024-140
Zhenghang Yuan , Zhitong Xiong , Lichao Mou , Xiao Xiang Zhu

Abstract. The rapid development of remote sensing technology has led to an exponential growth in satellite images, yet their inherent complexity often makes them difficult for non-expert users to understand. Natural language, as a carrier of human knowledge, can bridge common users and complicated satellite imagery. Additionally, when paired with visual data, natural language can be utilized to train large vision-language foundation models, significantly improving performance in various tasks. Despite these advancements, the remote sensing community still faces a challenge due to the lack of large- scale, high-quality vision-language datasets for satellite images. To address this challenge, we introduce a new image-text dataset, providing high-quality natural language descriptions for global-scale satellite data. Specifically, we utilize Sentinel-2 data for its global coverage as the foundational image source, employing semantic segmentation labels from the European Space Agency’s WorldCover project to enrich the descriptions of land covers. By conducting in-depth semantic analysis, we formulate detailed prompts to elicit rich descriptions from ChatGPT. We then include a manual verification process to enhance the dataset’s quality further. This step involves manual inspection and correction to refine the dataset. Finally, we offer the community ChatEarthNet, a large-scale image-text dataset characterized by global coverage, high quality, wide-ranging diversity, and detailed descriptions. ChatEarthNet consists of 163,488 image-text pairs with captions generated by ChatGPT3.5 and an additional 10,000 image-text pairs with captions generated by ChatGPT-4V(ision). This dataset has significant potential for both training and evaluating vision-language geo-foundation models for remote sensing. The code is publicly available at https://doi.org/10.5281/zenodo.11004358 (Yuan et al., 2024b), and the ChatEarthNet dataset is at https://doi.org/10.5281/zenodo.11003436 (Yuan et al., 2024c).

中文翻译：

ChatEarthNet：支持视觉语言地理基础模型的全球规模图像文本数据集

摘要。遥感技术的快速发展导致卫星图像呈指数级增长，但其固有的复杂性往往使非专业用户难以理解。自然语言作为人类知识的载体，可以在普通用户和复杂的卫星图像之间架起桥梁。此外，当与视觉数据配合使用时，自然语言可用于训练大型视觉语言基础模型，从而显着提高各种任务的性能。尽管取得了这些进步，但由于缺乏大规模、高质量的卫星图像视觉语言数据集，遥感界仍然面临挑战。为了应对这一挑战，我们引入了新的图像文本数据集，为全球规模的卫星数据提供高质量的自然语言描述。具体来说，我们利用 Sentinel-2 的全球覆盖数据作为基础图像源，并采用欧洲航天局 WorldCover 项目的语义分割标签来丰富土地覆盖的描述。通过进行深入的语义分析，我们制定了详细的提示，以从 ChatGPT 中引出丰富的描述。然后，我们加入手动验证过程以进一步提高数据集的质量。此步骤涉及手动检查和纠正以细化数据集。最后，我们向社区提供ChatEarthNet，这是一个大规模的图文数据集，具有全球覆盖、高质量、广泛多样性和详细描述的特点。 ChatEarthNet 由 163,488 个带有 ChatGPT3.5 生成的标题的图像文本对和另外 10,000 个带有 ChatGPT-4V(ision) 生成的标题的图像文本对组成。该数据集对于训练和评估遥感视觉语言地理基础模型具有巨大潜力。该代码可在 https://doi.org/10.5281/zenodo.11004358 上公开获取（Yuan 等人，2024b），ChatEarthNet 数据集位于 https://doi.org/10.5281/zenodo.11003436（Yuan 等人）等，2024c)。

更新日期：2024-06-27

点击分享查看原文

点击收藏

公开下载

阅读更多本刊新发论文