Grokking of Diffusion Models: Case Study on Modular Addition

ArXi:2604.17673v1 Announce Type: new Despite their empirical success, how diffusion models generalize remains poorly understood from a mechanistic perspective. We nstrate that diffusion models trained with flow-matching objectives exhibit grokking--delayed generalization after overfitting--on modular addition, enabling controlled analysis of their internal computations. We study this phenomenon across two levels of data regime. In a single-image regime, mechanistic dissection reveals that the model implements modular addition by composing periodic representations of individual operands.