Complex & Intelligent Systems ( IF 5.0 ) Pub Date : 2024-11-14 , DOI: 10.1007/s40747-024-01650-6 Tianping Li, Xiaolong Yang, Zhenyi Zhang, Zhaotong Cui, Zhou Maoxia
Recently, a number of vision transformer models for semantic segmentation have been proposed, with the majority of these achieving impressive results. However, they lack the ability to exploit the intrinsic position and channel features of the image and are less capable of multi-scale feature fusion. This paper presents a semantic segmentation method that successfully combines attention and multiscale representation, thereby enhancing performance and efficiency. This represents a significant advancement in the field. Multi-layers semantic extraction and multi-scale aggregation transformer decoder (MEMAFormer) is proposed, which consists of two components: mix-layers dual channel semantic extraction module (MDCE) and semantic aggregation pyramid pooling module (SAPPM). The MDCE incorporates a multi-layers cross attention module (MCAM) and an efficient channel attention module (ECAM). In MCAM, horizontal connections between encoder and decoder stages are employed as feature queries for the attention module. The hierarchical feature maps derived from different encoder and decoder stages are integrated into key and value. To address long-term dependencies, ECAM selectively emphasizes interdependent channel feature maps by integrating relevant features across all channels. The adaptability of the feature maps is reduced by pyramid pooling, which reduces the amount of computation without compromising performance. SAPPM is comprised of several distinct pooled kernels that extract context with a deeper flow of information, forming a multi-scale feature by integrating various feature sizes. The MEMAFormer-B0 model demonstrates superior performance compared to SegFormer-B0, exhibiting gains of 4.8%, 4.0% and 3.5% on the ADE20K, Cityscapes and COCO-stuff datasets, respectively.
中文翻译:
用于语义分割的混合层语义提取和多尺度聚合转换器
最近,已经提出了许多用于语义分割的视觉转换器模型,其中大多数都取得了令人印象深刻的结果。然而,它们缺乏利用图像固有位置和通道特征的能力,并且不太能进行多尺度特征融合。本文提出了一种语义分割方法,该方法成功地结合了注意力和多尺度表示,从而提高了性能和效率。这代表了该领域的重大进步。该文提出多层语义提取和多尺度聚合变换器解码器(MEMAFormer),它由混合层双通道语义提取模块(MDCE)和语义聚合金字塔池化模块(SAPPM)两部分组成。MDCE 包含一个多层交叉注意力模块 (MCAM) 和一个高效通道注意力模块 (ECAM)。在 MCAM 中,编码器和解码器阶段之间的水平连接被用作注意力模块的特征查询。从不同的编码器和解码器阶段派生的分层特征图被集成到 key 和 value 中。为了解决长期依赖关系,ECAM 通过集成所有通道中的相关特征,有选择地强调相互依赖的通道特征图。金字塔池降低了特征图的适应性,从而在不影响性能的情况下减少了计算量。SAPPM 由几个不同的池化内核组成,这些内核通过更深的信息流提取上下文,通过集成各种特征大小来形成多尺度特征。与 SegFormer-B0 相比,MEMAFormer-B0 模型表现出卓越的性能,在 ADE20K、Cityscapes 和 COCO-stuff 数据集上分别表现出 4.8%、4.0% 和 3.5% 的增益。