On the Origin of Implicit Regularization in SGD
Speaker
Samuel L. Smith, Deepmind
Time
2021-10-08 16:00 ~ 17:00, in a day (Asia/Shanghai Time)
Venue
Tencent meeting & Zoom
Meeting Info:
Time: 2021-10-08, 16:00-17:00
Zoom info
Meeting ID: 992 9411 7901
Passcode: 123456
Join link:
https://nyu.zoom.us/j/99294117901?pwd=VGd2b3V0bDBiNm5laVA3N3FBc2tkdz09 Zoom download:
https://zoom.us/download Tencent Meeting info (You cannot communicate with the speaker if using Tencent)
ID: 336 361 086
Password: 123456
Link:
https://meeting.tencent.com/dm/TZqvwCRPSQGiAbstract
For vanishing learning rates, the SGD iterates follows the path of gradient flow on the training loss. However a number of authors have observed that models trained with moderately large learning rates often achieve higher test accuracies despite not achieving lower training losses. To explain this phenomenon, I will use backward error analysis to prove that, for finite learning rates, the iterates of Random Shuffling SGD still follow the path of gradient flow but on a modified loss surface [1,2]. This modified loss surface is composed of the explicit training loss and an implicit regularization term, which penalizes the trace of the empirical covariance matrix. The scale of this implicit regularization term is proportional to the ratio of the learning rate to the batch size, thus explaining why finite learning rates and small batch sizes enhance generalization.
[1] Implicit Gradient Regularization, Barrett and Dherin, ICLR 2021
[2] On the Origin of Implicit Regularization in Stochastic Gradient Descent, Smith et al., ICLR 2021
Bio
Sam is a Senior Research Scientist at Deepmind. His research seeks to build fundamental understanding in deep learning, with a particular focus on the roles of optimization and initialization. He previously completed a PhD in theoretical physics.
--
FROM 202.121.181.*