Variational Neurons in Transformers for Language Modeling

ArXi:2603.28219v1 Announce Type: new Transformers for language modeling usually rely on deterministic internal computation, with uncertainty expressed mainly at the output layer. We We evaluate this design in compact next-token language-modeling settings. We compare deterministic and variational variants with both predictive and probabilistic criteria. Alongside negative log-likelihood, perplexity and accuracy, we analyze calibration, conditional variance, mutual information and latent-usage statistics. The resulting picture is clear.