Half the Nonlinearity Is Wasted: Measuring and Reallocating the Transformer's MLP Budget

ArXi:2603.03459v2 Announce Type: replace We investigate when transformer MLP nonlinearity is actually necessary. A gate with $d+1$ parameters decides when to replace the full MLP with a linear surrogate. Through systematic investigation across six models (162M-2.8B parameters), two architectures, and three corpora, we establish that nonlinearity need cannot be predicted from token identity: cross-corpus correlation is zero ($r < 0.05$). The routing decision is fully contextual.