AI RESEARCH

[P] Weight Norm Clipping Accelerates Grokking 18-66× | Zero Failures Across 300 Seeds | PDF in Repo

r/MachineLearning

Zero failures across 300 seeds. 66× speedup. 5 lines of code. We're two independent researchers. The method: per-row ℓ₂ clipping on decoder weights after every optimizer step. No additional memory, no weight decay needed. Results on the standard grokking benchmark (modular arithmetic, decoder-only transformer, same setup as Grokfast ): 2-layer (422k params): 66× over AdamW baseline with Lion+Clip 8-layer (1.6M params): 18× over baseline, zero failures across 300 seeds, IQR reduction 61-72% with edge initialization Honest scope: all experiments are modular arithmetic.