SPINE: Token-Selective Test-Time Reinforcement Learning with Entropy-Band Regularization

ArXi:2511.17938v2 Announce Type: replace-cross Large language models (LLMs) and multimodal LLMs (MLL-Ms) excel at chain-of-thought reasoning but face distribution shift at test-time and a lack of verifiable supervision. Recent test-time reinforcement learning (TTRL) methods derive label-free pseudo-rewards from self-consistency voting over sampled trajectories, yet they often collapse: the majority-vote reward prevails, responses shorten, and Pass declines.