Self-Aligned Reward: Towards Effective and Efficient Reasoners

ArXi:2509.05489v2 Announce Type: replace Reinforcement learning with verifiable rewards has significantly advanced reasoning in large language models (LLMs), but such signals remain coarse, offering only binary correctness feedback. This limitation often results in inefficiencies, including overly verbose reasoning and high computational cost, while existing solutions often compromise accuracy. To address this, we