Entropy Centroids as Intrinsic Rewards for Test-Time Scaling

ArXi:2604.26173v1 Announce Type: cross An effective way to scale up test-time compute of large language models is to sample multiple responses and then select the best one, as in Grok Heavy and Gemini Deep Think. Existing selection methods often rely on external reward models, which requires