International Journal of Computer Vision ( IF 11.6 ) Pub Date : 2024-09-26 , DOI: 10.1007/s11263-024-02211-7 Shuai Zhao, Linchao Zhu, Xiaohan Wang, Yi Yang
Self-supervised learning makes significant progress in pre-training large models, but struggles with small models. Mainstream solutions to this problem rely mainly on knowledge distillation, which involves a two-stage procedure: first training a large teacher model and then distilling it to improve the generalization ability of smaller ones. In this work, we introduce another one-stage solution to obtain pre-trained small models without the need for extra teachers, namely, slimmable networks for contrastive self-supervised learning (SlimCLR). A slimmable network consists of a full network and several weight-sharing sub-networks, which can be pre-trained once to obtain various networks, including small ones with low computation costs. However, interference between weight-sharing networks leads to severe performance degradation in self-supervised cases, as evidenced by gradient magnitude imbalance and gradient direction divergence. The former indicates that a small proportion of parameters produce dominant gradients during backpropagation, while the main parameters may not be fully optimized. The latter shows that the gradient direction is disordered, and the optimization process is unstable. To address these issues, we introduce three techniques to make the main parameters produce dominant gradients and sub-networks have consistent outputs. These techniques include slow start training of sub-networks, online distillation, and loss re-weighting according to model sizes. Furthermore, theoretical results are presented to demonstrate that a single slimmable linear layer is sub-optimal during linear evaluation. Thus a switchable linear probe layer is applied during linear evaluation. We instantiate SlimCLR with typical contrastive learning frameworks and achieve better performance than previous arts with fewer parameters and FLOPs. The code is available at https://github.com/mzhaoshuai/SlimCLR.
中文翻译:
用于对比自监督学习的可精简网络
自监督学习在预训练大型模型方面取得了显着进展,但在小型模型方面却遇到了困难。该问题的主流解决方案主要依靠知识蒸馏,它涉及两个阶段的过程:首先训练大型教师模型,然后对其进行蒸馏以提高较小模型的泛化能力。在这项工作中,我们引入了另一种单阶段解决方案来获得预训练的小模型,而无需额外的教师,即用于对比自监督学习的可精简网络(SlimCLR)。可精简网络由一个全网络和多个权重共享子网络组成,可以对这些子网络进行一次预训练以获得各种网络,包括计算成本较低的小型网络。然而,权重共享网络之间的干扰会导致自监督情况下的严重性能下降,梯度幅度不平衡和梯度方向发散就证明了这一点。前者表明一小部分参数在反向传播过程中产生主导梯度,而主要参数可能没有得到充分优化。后者说明梯度方向无序,优化过程不稳定。为了解决这些问题,我们引入了三种技术来使主要参数产生主导梯度并且子网络具有一致的输出。这些技术包括子网络的慢启动训练、在线蒸馏和根据模型大小重新加权损失。此外,理论结果证明单个可精简线性层在线性评估期间不是最优的。因此,在线性评估期间应用可切换线性探针层。 我们用典型的对比学习框架实例化 SlimCLR,并以更少的参数和 FLOP 实现了比以前的技术更好的性能。代码可在 https://github.com/mzhaoshuai/SlimCLR 获取。