AI RESEARCH

When and Why Grouping Attention Heads Accelerates Muon Optimization

arXiv CS.LG

ArXi:2605.08933v1 Announce Type: new Muon orthogonalizes matrix updates, but multi-head attention naturally operates at the level of heads. This granularity mismatch raises the question of whether Muon should be applied to the full attention projection, to individual heads, or to intermediate head groups. We study this question through a one-step descent comparison between full-matrix Muon and group-wise Muon.