Adaptive Memory Momentum via a Model-Based Framework for Deep Learning Optimization

ArXi:2510.04988v3 Announce Type: replace The vast majority of modern deep learning models are trained with momentum-based first-order optimizers. The momentum term governs the optimizer's memory by determining how much each past gradient contributes to the current convergence direction. Fundamental momentum methods, such as Nestero Accelerated Gradient and the Heavy Ball method, as well as recent optimizers such as AdamW and Lion, all rely on the momentum coefficient that is customarily set to $\beta = 0.9$ and kept constant during model.