Probing RLVR training instability through the lens of objective-level hacking

ArXi:2602.01103v2 Announce Type: replace Prolonged reinforcement learning with verifiable rewards (RLVR) has been shown to drive continuous improvements in the reasoning capabilities of large language models, but the