International Journal of Computer Vision ( IF 11.6 ) Pub Date : 2024-09-15 , DOI: 10.1007/s11263-024-02213-5 Zixin Wang, Yadan Luo, Liang Zheng, Zhuoxiao Chen, Sen Wang, Zi Huang
This article presents a comprehensive survey of online test-time adaptation (OTTA), focusing on effectively adapting machine learning models to distributionally different target data upon batch arrival. Despite the recent proliferation of OTTA methods, conclusions from previous studies are inconsistent due to ambiguous settings, outdated backbones, and inconsistent hyperparameter tuning, which obscure core challenges and hinder reproducibility. To enhance clarity and enable rigorous comparison, we classify OTTA techniques into three primary categories and benchmark them using a modern backbone, the Vision Transformer. Our benchmarks cover conventional corrupted datasets such as CIFAR-10/100-C and ImageNet-C, as well as real-world shifts represented by CIFAR-10.1, OfficeHome, and CIFAR-10-Warehouse. The CIFAR-10-Warehouse dataset includes a variety of variations from different search engines and synthesized data generated through diffusion models. To measure efficiency in online scenarios, we introduce novel evaluation metrics, including GFLOPs, wall clock time, and GPU memory usage, providing a clearer picture of the trade-offs between adaptation accuracy and computational overhead. Our findings diverge from existing literature, revealing that (1) transformers demonstrate heightened resilience to diverse domain shifts, (2) the efficacy of many OTTA methods relies on large batch sizes, and (3) stability in optimization and resistance to perturbations are crucial during adaptation, particularly when the batch size is 1. Based on these insights, we highlight promising directions for future research. Our benchmarking toolkit and source code are available at https://github.com/Jo-wang/OTTA_ViT_survey.
中文翻译:
寻找失去的在线考试时间适应:一项调查
本文对在线测试时间适应(OTTA)进行了全面的调查,重点是在批次到达时有效地使机器学习模型适应分布不同的目标数据。尽管最近 OTTA 方法激增,但由于设置不明确、过时的主干和不一致的超参数调整,先前研究的结论并不一致,这些都掩盖了核心挑战并阻碍了可重复性。为了提高清晰度并进行严格比较,我们将 OTTA 技术分为三个主要类别,并使用现代骨干 Vision Transformer 对它们进行基准测试。我们的基准测试涵盖传统的损坏数据集,例如 CIFAR-10/100-C 和 ImageNet-C,以及以 CIFAR-10.1、OfficeHome 和 CIFAR-10-Warehouse 为代表的现实世界的变化。 CIFAR-10-Warehouse 数据集包括来自不同搜索引擎的各种变体以及通过扩散模型生成的合成数据。为了衡量在线场景中的效率,我们引入了新颖的评估指标,包括 GFLOP、挂钟时间和 GPU 内存使用情况,从而更清楚地了解适应精度和计算开销之间的权衡。我们的研究结果与现有文献有所不同,表明:(1) Transformer 对不同领域转换表现出更高的弹性,(2) 许多 OTTA 方法的有效性依赖于大批量,(3) 优化的稳定性和抗扰动性在过程中至关重要。适应,特别是当批量大小为 1 时。基于这些见解,我们强调了未来研究的有希望的方向。我们的基准测试工具包和源代码可从 https://github.com/Jo-wang/OTTA_ViT_survey 获取。