当前位置: X-MOL 学术IEEE Trans. Geosci. Remote Sens. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Extracting Building Footprint From Remote Sensing Images by an Enhanced Vision Transformer Network
IEEE Transactions on Geoscience and Remote Sensing ( IF 7.5 ) Pub Date : 7-1-2024 , DOI: 10.1109/tgrs.2024.3421651
Hua Zhang 1 , Hu Dou 1 , Zelang Miao 2 , Nanshan Zheng 1 , Ming Hao 1 , Wenzhong Shi 3
Affiliation  

Automatic extraction of building footprints from images is one of the vital means for obtaining building footprint data. However, due to the varied appearances, scales, and intricate structures of buildings, this task still remains challenging. Recently, the vision transformer (ViT) has exhibited significant promise in semantic segmentation, thanks to its efficient capability in obtaining long-range dependencies. This article employs the ViT for extracting building footprints. Yet, utilizing ViT often encounters limitations: extensive computational costs and insufficient preservation of local details in the process of extracting features. To address these challenges, a network based on an enhanced ViT (EViT) is proposed. In this network, one convolutional neural network (CNN)-based branch is introduced to extract comprehensive spatial details. Another branch, consisting of several multiscale enhanced ViT (EV) blocks, is developed to capture global dependencies. Subsequently, a multiscale and enhanced boundary feature extraction block is developed to fuse global dependencies and local details and perform boundary features enhancement, thereby yielding multiscale global-local contextual information with enhanced boundary feature. Specifically, we present a window-based cascaded multihead self-attention (W-CMSA) mechanism, characterized by linear complexity in relation to the window size, which not only reduces computational costs but also enhances attention diversity. The EViT has undergone comprehensive evaluation alongside other state-of-the-art (SOTA) approaches using three benchmark datasets. The findings illustrate that EViT exhibits promising performance in extracting building footprints and surpasses SOTA approaches. Specifically, it achieved 82.45%, 91.76%, and 77.14% IoU on the SpaceNet, WHU, and Massachusetts datasets, respectively. The implementation of EViT is available at https://github.com/dh609/EViT .

中文翻译:


通过增强视觉变压器网络从遥感图像中提取建筑物足迹



从图像中自动提取建筑物足迹是获取建筑物足迹数据的重要手段之一。然而,由于建筑物的外观、规模和结构复杂多样,这项任务仍然具有挑战性。最近,视觉转换器(ViT)由于其获取远程依赖关系的高效能力,在语义分割方面展现出了巨大的前景。本文使用 ViT 来提取建筑物足迹。然而,利用 ViT 经常遇到局限性:计算成本高昂,并且在提取特征的过程中局部细节保留不足。为了应对这些挑战,提出了一种基于增强型ViT(EViT)的网络。在该网络中,引入了一个基于卷积神经网络(CNN)的分支来提取全面的空间细节。另一个分支由多个多尺度增强型 ViT (EV) 块组成,旨在捕获全局依赖性。随后,开发了多尺度和增强的边界特征提取块来融合全局依赖性和局部细节并执行边界特征增强,从而产生具有增强边界特征的多尺度全局局部上下文信息。具体来说,我们提出了一种基于窗口的级联多头自注意力(W-CMSA)机制,其特征是与窗口大小相关的线性复杂度,这不仅降低了计算成本,而且增强了注意力多样性。 EViT 已使用三个基准数据集与其他最先进的 (SOTA) 方法一起进行了全面评估。研究结果表明,EViT 在提取建筑足迹方面表现出良好的性能,并超越了 SOTA 方法。具体来说,达到了82.45%、91.76%、77。SpaceNet、WHU 和马萨诸塞州数据集上的 IoU 分别为 14%。 EViT 的实施可在https://github.com/dh609/EViT 。
更新日期:2024-08-19
down
wechat
bug