AdaPonderLM: Gated Pondering Language Models with Token-Wise Adaptive Depth

ArXi:2603.01914v2 Announce Type: replace Test-time scaling via recurrent/iterative Transformers enables large language models to spend computation at inference, but most pretrained recurrent LMs run a fixed number of iterations, wasting compute on easy tokens and lacking token-wise adaptivity. Following the core idea of Adaptive Computation Time(ACT) and Early Exit(EE), we propose AdaPonderLM, a self-supervised recurrent language model that learns token-wise early exiting during pre.