SAGE: Shaping Anchors for Guided Exploration in RLVR of LLMs

ArXi:2605.18864v1 Announce Type: cross Recent studies observe that reinforcement learning with verifiable rewards (RLVR) reliably improves pass on reasoning tasks, yet often fails to yield comparable gains in pass, raising the question of whether RLVR genuinely enables large language models to acquire novel reasoning abilities or merely enhances the efficiency of sampling reasoning modes already present in the base model. Prior analyses largely the latter view, attributing this limitation to structural properties of standard RLVR objectives that result in insufficient exploration pressure.