Incentivizing Strong Reasoning from Weak Supervision

ArXi:2505.20072v3 Announce Type: replace-cross Large language models (LLMs) have nstrated impressive performance on reasoning-intensive tasks, but enhancing their reasoning abilities typically relies on either reinforcement learning (RL) with verifiable signals or supervised fine-tuning (SFT) with high-quality long chain-of-thought (CoT) nstrations, both of which are expensive. In this paper, we study a novel problem of incentivizing the reasoning capacity of LLMs without expensive high-quality nstrations and reinforcement learning.