AI RESEARCH
Countdown-Code: A Testbed for Studying The Emergence and Generalization of Reward Hacking in RLVR
arXiv CS.LG
•
ArXi:2603.07084v1 Announce Type: new Reward hacking is a form of misalignment in which models overoptimize proxy rewards without genuinely solving the underlying task. Precisely measuring reward hacking occurrence remains challenging because true task rewards are often expensive or impossible to compute. We