Robust Optimization for Mitigating Reward Hacking with Correlated Proxies

ArXi:2604.12086v1 Announce Type: new Designing robust reinforcement learning (RL) agents in the presence of imperfect reward signals remains a core challenge. In practice, agents are often trained with proxy rewards that only approximate the true objective, leaving them vulnerable to reward hacking, where high proxy returns arise from unintended or exploitative behaviors.