当前位置:
X-MOL 学术
›
Med. Image Anal.
›
论文详情
Our official English website, www.x-mol.net, welcomes your
feedback! (Note: you will need to create a separate account there.)
VSmTrans: A hybrid paradigm integrating self-attention and convolution for 3D medical image segmentation
Medical Image Analysis ( IF 10.7 ) Pub Date : 2024-08-24 , DOI: 10.1016/j.media.2024.103295 Tiange Liu 1 , Qingze Bai 2 , Drew A Torigian 3 , Yubing Tong 3 , Jayaram K Udupa 3
Medical Image Analysis ( IF 10.7 ) Pub Date : 2024-08-24 , DOI: 10.1016/j.media.2024.103295 Tiange Liu 1 , Qingze Bai 2 , Drew A Torigian 3 , Yubing Tong 3 , Jayaram K Udupa 3
Affiliation
Vision Transformers recently achieved a competitive performance compared with CNNs due to their excellent capability of learning global representation. However, there are two major challenges when applying them to 3D image segmentation: i) Because of the large size of 3D medical images, comprehensive global information is hard to capture due to the enormous computational costs. ii) Insufficient local inductive bias in Transformers affects the ability to segment detailed features such as ambiguous and subtly defined boundaries. Hence, to apply the Vision Transformer mechanism in the medical image segmentation field, the above challenges need to be overcome adequately. We propose a hybrid paradigm, called Variable-Shape Mixed Transformer (VSmTrans), that integrates self-attention and convolution and can enjoy the benefits of free learning of both complex relationships from the self-attention mechanism and the local prior knowledge from convolution. Specifically, we designed a Variable-Shape self-attention mechanism, which can rapidly expand the receptive field without extra computing cost and achieve a good trade-off between global awareness and local details. In addition, the parallel convolution paradigm introduces strong local inductive bias to facilitate the ability to excavate details. Meanwhile, a pair of learnable parameters can automatically adjust the importance of the above two paradigms. Extensive experiments were conducted on two public medical image datasets with different modalities: the AMOS CT dataset and the BraTS2021 MRI dataset. Our method achieves the best average Dice scores of 88.3 % and 89.7 % on these datasets, which are superior to the previous state-of-the-art Swin Transformer-based and CNN-based architectures. A series of ablation experiments were also conducted to verify the efficiency of the proposed hybrid mechanism and the components and explore the effectiveness of those key parameters in VSmTrans. The proposed hybrid Transformer-based backbone network for 3D medical image segmentation can tightly integrate self-attention and convolution to exploit the advantages of these two paradigms. The experimental results demonstrate our method's superiority compared to other state-of-the-art methods. The hybrid paradigm seems to be most appropriate to the medical image segmentation field. The ablation experiments also demonstrate that the proposed hybrid mechanism can effectively balance large receptive fields with local inductive biases, resulting in highly accurate segmentation results, especially in capturing details. Our code is available at https://github.com/qingze-bai/VSmTrans.
中文翻译:
VSmTrans:一种集成自注意力和卷积的 3D 医学图像分割混合范例
Vision Transformers 最近由于其出色的学习全局表示的能力而取得了与 CNN 相比具有竞争力的性能。然而,将它们应用于 3D 图像分割时存在两个主要挑战:i)由于 3D 医学图像尺寸较大,由于计算成本巨大,很难捕获全面的全局信息。 ii) Transformers 中局部归纳偏差不足会影响分割详细特征的能力,例如模糊和微妙定义的边界。因此,要将Vision Transformer机制应用于医学图像分割领域,需要充分克服上述挑战。我们提出了一种称为可变形状混合变压器(VSmTrans)的混合范式,它集成了自注意力和卷积,并且可以享受自注意力机制中的复杂关系和卷积中的局部先验知识的自由学习的好处。具体来说,我们设计了一种Variable-Shape自注意力机制,它可以快速扩大感受野,而无需额外的计算成本,并在全局意识和局部细节之间实现良好的权衡。此外,并行卷积范式引入了强局部归纳偏置,以促进挖掘细节的能力。同时,一对可学习的参数可以自动调整上述两种范式的重要性。在两个具有不同模态的公共医学图像数据集:AMOS CT 数据集和 BraTS2021 MRI 数据集上进行了广泛的实验。我们的方法在这些数据集上实现了 88.3% 和 89.7% 的最佳平均 Dice 分数,优于之前最先进的基于 Swin Transformer 和基于 CNN 的架构。 还进行了一系列烧蚀实验,以验证所提出的混合机制和组件的效率,并探索 VSmTrans 中这些关键参数的有效性。所提出的用于 3D 医学图像分割的基于 Transformer 的混合骨干网络可以紧密集成自注意力和卷积,以利用这两种范式的优势。实验结果证明了我们的方法相对于其他最先进的方法的优越性。混合范式似乎最适合医学图像分割领域。消融实验还表明,所提出的混合机制可以有效地平衡大感受野与局部感应偏差,从而产生高精度的分割结果,特别是在捕获细节方面。我们的代码可在 https://github.com/qingze-bai/VSmTrans 获取。
更新日期:2024-08-24
中文翻译:
VSmTrans:一种集成自注意力和卷积的 3D 医学图像分割混合范例
Vision Transformers 最近由于其出色的学习全局表示的能力而取得了与 CNN 相比具有竞争力的性能。然而,将它们应用于 3D 图像分割时存在两个主要挑战:i)由于 3D 医学图像尺寸较大,由于计算成本巨大,很难捕获全面的全局信息。 ii) Transformers 中局部归纳偏差不足会影响分割详细特征的能力,例如模糊和微妙定义的边界。因此,要将Vision Transformer机制应用于医学图像分割领域,需要充分克服上述挑战。我们提出了一种称为可变形状混合变压器(VSmTrans)的混合范式,它集成了自注意力和卷积,并且可以享受自注意力机制中的复杂关系和卷积中的局部先验知识的自由学习的好处。具体来说,我们设计了一种Variable-Shape自注意力机制,它可以快速扩大感受野,而无需额外的计算成本,并在全局意识和局部细节之间实现良好的权衡。此外,并行卷积范式引入了强局部归纳偏置,以促进挖掘细节的能力。同时,一对可学习的参数可以自动调整上述两种范式的重要性。在两个具有不同模态的公共医学图像数据集:AMOS CT 数据集和 BraTS2021 MRI 数据集上进行了广泛的实验。我们的方法在这些数据集上实现了 88.3% 和 89.7% 的最佳平均 Dice 分数,优于之前最先进的基于 Swin Transformer 和基于 CNN 的架构。 还进行了一系列烧蚀实验,以验证所提出的混合机制和组件的效率,并探索 VSmTrans 中这些关键参数的有效性。所提出的用于 3D 医学图像分割的基于 Transformer 的混合骨干网络可以紧密集成自注意力和卷积,以利用这两种范式的优势。实验结果证明了我们的方法相对于其他最先进的方法的优越性。混合范式似乎最适合医学图像分割领域。消融实验还表明,所提出的混合机制可以有效地平衡大感受野与局部感应偏差,从而产生高精度的分割结果,特别是在捕获细节方面。我们的代码可在 https://github.com/qingze-bai/VSmTrans 获取。