AI RESEARCH

Scaling Probabilistic Transformer via Efficient Cross-Scale Hyperparameter Transfer

arXiv CS.CL

ArXi:2604.25409v1 Announce Type: new Probabilistic Transformer (PT), a white-box probabilistic model for contextual word representation, has nstrated substantial similarity to standard Transformers in both computational structure and downstream task performance on small models and small to medium sized datasets. However, PT is less robust to hyperparameter choices than standard Transformers, making it harder to scale efficiently.