当前位置:
X-MOL 学术
›
Comm. Pure Appl. Math.
›
论文详情
Our official English website, www.x-mol.net, welcomes your
feedback! (Note: you will need to create a separate account there.)
High-dimensional limit theorems for SGD: Effective dynamics and critical scaling
Communications on Pure and Applied Mathematics ( IF 3.1 ) Pub Date : 2023-10-04 , DOI: 10.1002/cpa.22169 Gérard Ben Arous 1 , Reza Gheissari 2 , Aukosh Jagannath 3, 4
Communications on Pure and Applied Mathematics ( IF 3.1 ) Pub Date : 2023-10-04 , DOI: 10.1002/cpa.22169 Gérard Ben Arous 1 , Reza Gheissari 2 , Aukosh Jagannath 3, 4
Affiliation
We study the scaling limits of stochastic gradient descent (SGD) with constant step-size in the high-dimensional regime. We prove limit theorems for the trajectories of summary statistics (i.e., finite-dimensional functions) of SGD as the dimension goes to infinity. Our approach allows one to choose the summary statistics that are tracked, the initialization, and the step-size. It yields both ballistic (ODE) and diffusive (SDE) limits, with the limit depending dramatically on the former choices. We show a critical scaling regime for the step-size, below which the effective ballistic dynamics matches gradient flow for the population loss, but at which, a new correction term appears which changes the phase diagram. About the fixed points of this effective dynamics, the corresponding diffusive limits can be quite complex and even degenerate. We demonstrate our approach on popular examples including estimation for spiked matrix and tensor models and classification via two-layer networks for binary and XOR-type Gaussian mixture models. These examples exhibit surprising phenomena including multimodal timescales to convergence as well as convergence to sub-optimal solutions with probability bounded away from zero from random (e.g., Gaussian) initializations. At the same time, we demonstrate the benefit of overparametrization by showing that the latter probability goes to zero as the second layer width grows.
中文翻译:
SGD 的高维极限定理:有效动力学和临界标度
我们研究高维状态下具有恒定步长的随机梯度下降(SGD)的缩放限制。我们证明了当维度趋向无穷大时 SGD 的汇总统计量(即有限维函数)轨迹的极限定理。我们的方法允许人们选择跟踪的汇总统计数据、初始化和步长。它产生弹道 (ODE) 和扩散 (SDE) 极限,该极限很大程度上取决于前一种选择。我们展示了步长的关键缩放机制,低于该步长时,有效弹道动力学与总体损失的梯度流相匹配,但此时出现了一个新的校正项,它改变了相图。关于这种有效动力学的不动点,相应的扩散极限可能非常复杂,甚至是退化的。我们在流行的例子中展示了我们的方法,包括尖峰矩阵和张量模型的估计以及通过二元和异或型高斯混合模型的双层网络进行分类。这些例子表现出令人惊讶的现象,包括多模态时间尺度的收敛以及收敛到次优解的概率从随机(例如高斯)初始化开始远离零。同时,我们通过证明随着第二层宽度的增加后者概率变为零来证明过参数化的好处。
更新日期:2023-10-04
中文翻译:
SGD 的高维极限定理:有效动力学和临界标度
我们研究高维状态下具有恒定步长的随机梯度下降(SGD)的缩放限制。我们证明了当维度趋向无穷大时 SGD 的汇总统计量(即有限维函数)轨迹的极限定理。我们的方法允许人们选择跟踪的汇总统计数据、初始化和步长。它产生弹道 (ODE) 和扩散 (SDE) 极限,该极限很大程度上取决于前一种选择。我们展示了步长的关键缩放机制,低于该步长时,有效弹道动力学与总体损失的梯度流相匹配,但此时出现了一个新的校正项,它改变了相图。关于这种有效动力学的不动点,相应的扩散极限可能非常复杂,甚至是退化的。我们在流行的例子中展示了我们的方法,包括尖峰矩阵和张量模型的估计以及通过二元和异或型高斯混合模型的双层网络进行分类。这些例子表现出令人惊讶的现象,包括多模态时间尺度的收敛以及收敛到次优解的概率从随机(例如高斯)初始化开始远离零。同时,我们通过证明随着第二层宽度的增加后者概率变为零来证明过参数化的好处。