AI RESEARCH

Noisy Data is Destructive to Reinforcement Learning with Verifiable Rewards

arXiv CS.LG

ArXi:2603.16140v1 Announce Type: new Reinforcement learning with verifiable rewards (RLVR) has driven recent capability advances of large language models across various domains. Recent studies suggest that improved RLVR algorithms allow models to learn effectively from incorrect annotations, achieving performance comparable to learning from clean data. In this work, we show that these findings are invalid because the claimed 100% noisy