Gradient Dynamics of Attention: How Cross-Entropy Sculpts Bayesian Manifolds

ArXi:2512.22473v4 Announce Type: replace-cross Transformers empirically perform precise probabilistic reasoning in carefully constructed ``Bayesian wind tunnels'' and in large-scale language models, yet the mechanisms by which gradient-based learning creates the required internal geometry remain opaque. We provide a complete first-order analysis of how cross-entropy