AI RESEARCH
Revisiting the Adam-SGD Gap in LLM Pre-Training: The Role of Large Effective Learning Rates
arXiv CS.LG
•
ArXi:2605.17787v1 Announce Type: new It is widely believed that stochastic gradient descent (SGD) performs significantly worse than adaptive optimizers such as Adam in pre-