Training-Induced Escape from Token Clustering in a Mean-Field Formulation of Transformers

ArXi:2605.07772v1 Announce Type: new Transformers perform inference by iteratively transforming token representations across layers. This layerwise computation has been studied empirically, and recent mean-field theories of Transformer dynamics explain how attention can drive token distributions toward clustering. However, existing mean-field analyses largely treat model parameters as prescribed, leaving open how