Do Not Step Into the Same River Twice: Learning to Reason from Trial and Error

ArXi:2510.26109v4 Announce Type: replace Reinforcement learning with verifiable rewards (RLVR) has significantly boosted the reasoning capability of language models (LMs). However, existing RLVR approaches train LMs based on their own on-policy responses and are constrained by the initial capability of LMs, thus prone to exploration stagnation, in which LMs fail to solve