当前位置: X-MOL 学术ISPRS J. Photogramm. Remote Sens. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Homogeneous tokenizer matters: Homogeneous visual tokenizer for remote sensing image understanding
ISPRS Journal of Photogrammetry and Remote Sensing ( IF 10.6 ) Pub Date : 2024-09-21 , DOI: 10.1016/j.isprsjprs.2024.09.009
Run Shao, Zhaoyang Zhang, Chao Tao, Yunsheng Zhang, Chengli Peng, Haifeng Li

On the basis of the transformer architecture and the pretext task of “next-token prediction”, multimodal large language models (MLLMs) are revolutionizing the paradigm in the field of remote sensing image understanding. However, the tokenizer, as one of the fundamental components of MLLMs, has long been overlooked or even misunderstood in visual tasks. A key factor contributing to the great comprehension power of large language models is that natural language tokenizers utilize meaningful words or subwords as the basic elements of language. In contrast, mainstream visual tokenizers, represented by patch-based methods such as Patch Embed, rely on meaningless rectangular patches as basic elements of vision. Analogous to words or subwords in language, we define semantically independent regions (SIRs) for vision and then propose two properties that an ideal visual tokenizer should possess: (1) homogeneity, where SIRs serve as the basic elements of vision, and (2) adaptivity, which allows for a flexible number of tokens to accommodate images of any size and tasks of any granularity. On this basis, we design a simple HOmogeneous visual tOKenizer: HOOK. HOOK consists of two modules: an object perception module (OPM) and an object vectorization module (OVM). To achieve homogeneity, the OPM splits the image into 4 × 4 pixel seeds and then uses a self-attention mechanism to identify SIRs. The OVM employs cross-attention to merge seeds within the same SIR. To achieve adaptability, the OVM predefines a variable number of learnable vectors as cross-attention queries, allowing for the adjustment of the token quantity. We conducted experiments on the NWPU-RESISC45, WHU-RS19, and NaSC-TG2 classification datasets for sparse tasks and the GID5 and DGLCC segmentation datasets for dense tasks. The results show that the visual tokens obtained by HOOK correspond to individual objects, thereby verifying their homogeneity. Compared with randomly initialized or pretrained Patch Embed, which required more than one hundred tokens per image, HOOK required only 6 and 8 tokens for sparse and dense tasks, respectively, resulting in performance improvements of 2% to 10% and efficiency improvements of 1.5 to 2.8 times. The homogeneity and adaptability of the proposed approach provide new perspectives for the study of visual tokenizers. Guided by these principles, the developed HOOK has the potential to replace traditional Patch Embed. The code is available at https://github.com/GeoX-Lab/Hook.

中文翻译:


均质分词器很重要:用于遥感图像理解的均质视觉分词器



基于 Transformer 架构和“下一个令牌预测”的借口任务,多模态大语言模型(MLLM)正在彻底改变遥感图像理解领域的范式。然而,分词器作为 MLLM 的基本组件之一,在视觉任务中长期以来一直被忽视甚至误解。大型语言模型具有强大理解能力的一个关键因素是自然语言分词器利用有意义的单词或子词作为语言的基本元素。相比之下,以基于补丁的方法(例如补丁嵌入)为代表的主流视觉分词器依赖于无意义的矩形补丁作为视觉的基本元素。与语言中的单词或子词类似,我们定义了视觉的语义独立区域(SIR),然后提出理想的视觉分词器应具备的两个属性:(1)同质性,其中 SIR 作为视觉的基本元素,(2)适应性,允许灵活数量的令牌来适应任何大小的图像和任何粒度的任务。在此基础上,我们设计了一个简单的同质视觉标记器:HOOK。 HOOK由两个模块组成:对象感知模块(OPM)和对象矢量化模块(OVM)。为了实现同质性,OPM 将图像分割为 4 × 4 像素种子,然后使用自注意力机制来识别 SIR。 OVM 使用交叉注意力来合并同一 SIR 内的种子。为了实现适应性,OVM 预定义了可变数量的可学习向量作为交叉注意力查询,从而允许调整 token 数量。 我们在稀疏任务的 NWPU-RESISC45、WHU-RS19 和 NaSC-TG2 分类数据集以及密集任务的 GID5 和 DGLCC 分割数据集上进行了实验。结果表明,HOOK获得的视觉标记与单个对象相对应,从而验证了它们的同质性。与随机初始化或预训练的 Patch Embed(每个图像需要超过 100 个 token)相比,HOOK 对于稀疏和密集任务分别只需要 6 和 8 个 token,性能提高了 2% 到 10%,效率提高了 1.5 到 1.5%。 2.8倍。该方法的同质性和适应性为视觉分词器的研究提供了新的视角。在这些原则的指导下,开发的 HOOK 有潜力取代传统的 Patch Embed。该代码可在 https://github.com/GeoX-Lab/Hook 获取。
更新日期:2024-09-21
down
wechat
bug