Co-rewarding: Stable Self-supervised RL for Eliciting Reasoning in Large Language Models

ArXi:2508.00410v3 Announce Type: replace While reinforcement learning with verifiable rewards (RLVR) is effective to improve the reasoning ability of large language models (LLMs), its reliance on human-annotated labels leads to the scaling up dilemma, especially for complex tasks. Recent self-rewarding methods investigate a label-free alternative to unlock the reasoning capabilities of LLMs, yet they frequently encounter the non-negligible