AI RESEARCH

Revisiting the Adam-SGD Gap in LLM Pre-Training: The Role of Large Effective Learning Rates

arXiv CS.LG

ArXi:2605.17787v1 Announce Type: new It is widely believed that stochastic gradient descent (SGD) performs significantly worse than adaptive optimizers such as Adam in pre-