Learning Target-Aware Vision Transformers for Real-Time UAV Tracking,IEEE Transactions on Geoscience and Remote Sensing

当前位置： X-MOL 学术 › IEEE Trans. Geosci. Remote Sens. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Learning Target-Aware Vision Transformers for Real-Time UAV Tracking
IEEE Transactions on Geoscience and Remote Sensing ( IF 7.5 ) Pub Date : 6-21-2024 , DOI: 10.1109/tgrs.2024.3417400
Shuiwang Li ₁ , Xiangyang Yang ₁ , Xucheng Wang ₁ , Dan Zeng ₂ , Hengzhou Ye ₁ , Qijun Zhao ₃

Affiliation

In recent years, the field of unmanned aerial vehicle (UAV) tracking has grown rapidly, finding numerous applications across various industries. While the discriminative correlation filters (DCF)-based trackers remain the most efficient and widely used in the UAV tracking, recently lightweight convolutional neural network (CNN)-based trackers using filter pruning have also demonstrated impressive efficiency and precision. However, the performance of these lightweight CNN-based trackers is still far from satisfactory. In the generic visual tracking, emerging vision transformer (ViT)-based trackers have shown great success by using cross-attention instead of correlation operation, enabling more effective capturing of relationships between the target and the search image. But to best of the authors’ knowledge, the UAV tracking community has not yet well explored the potential of ViTs for more effective and efficient template-search coupling for UAV tracking. In this article, we propose an efficient ViT-based tracking framework for real-time UAV tracking. Our framework integrates feature learning and template-search coupling into an efficient one-stream ViT to avoid an extra heavy relation modeling module. However, we observe that it tends to weaken the target information through transformer blocks due to the significantly more background tokens. To address this problem, we propose to maximize the mutual information (MI) between the template image and its feature representation produced by the ViT. The proposed method is dubbed TATrack. In addition, to further enhance efficiency, we introduce a novel MI maximization-based knowledge distillation, which strikes a better trade-off between accuracy and efficiency. Exhaustive experiments on five benchmarks show that the proposed tracker achieves state-of-the-art performance in UAV tracking. Code is released at: https://github.com/xyyang317/TATrack .

中文翻译：

学习用于实时无人机跟踪的目标感知视觉变压器

近年来，无人机（UAV）跟踪领域发展迅速，在各个行业都有大量应用。虽然基于判别相关滤波器 (DCF) 的跟踪器仍然是无人机跟踪中最有效且广泛使用的跟踪器，但最近使用滤波器修剪的基于轻量级卷积神经网络 (CNN) 的跟踪器也表现出了令人印象深刻的效率和精度。然而，这些基于 CNN 的轻量级跟踪器的性能仍然远不能令人满意。在通用视觉跟踪中，新兴的基于视觉变换器（ViT）的跟踪器通过使用交叉注意力而不是相关操作而取得了巨大的成功，从而能够更有效地捕获目标和搜索图像之间的关系。但据作者所知，无人机跟踪社区尚未充分探索 ViT 为无人机跟踪提供更有效和高效的模板搜索耦合的潜力。在本文中，我们提出了一种基于 ViT 的高效跟踪框架，用于实时无人机跟踪。我们的框架将特征学习和模板搜索耦合集成到高效的单流 ViT 中，以避免额外繁重的关系建模模块。然而，我们观察到，由于背景标记明显增多，它往往会通过变压器块削弱目标信息。为了解决这个问题，我们建议最大化模板图像与其 ViT 生成的特征表示之间的互信息（MI）。所提出的方法被称为 TATrack。此外，为了进一步提高效率，我们引入了一种新颖的基于 MI 最大化的知识蒸馏，它在准确性和效率之间取得了更好的权衡。对五个基准的详尽实验表明，所提出的跟踪器在无人机跟踪方面实现了最先进的性能。代码发布于： https://github.com/xyyang317/TATrack 。

更新日期：2024-08-19

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>