Countdown-Code: A Testbed for Studying The Emergence and Generalization of Reward Hacking in RLVR

ArXi:2603.07084v1 Announce Type: new Reward hacking is a form of misalignment in which models overoptimize proxy rewards without genuinely solving the underlying task. Precisely measuring reward hacking occurrence remains challenging because true task rewards are often expensive or impossible to compute. We