AI RESEARCH
Probing RLVR training instability through the lens of objective-level hacking
arXiv CS.AI
•
ArXi:2602.01103v2 Announce Type: replace Prolonged reinforcement learning with verifiable rewards (RLVR) has been shown to drive continuous improvements in the reasoning capabilities of large language models, but the