When Reward Hacking Rebounds: Understanding and Mitigating It with Representation-Level Signals

ArXi:2604.01476v1 Announce Type: new Reinforcement learning for LLMs is vulnerable to reward hacking, where models exploit shortcuts to maximize reward without solving the intended task. We systematically study this phenomenon in coding tasks using an environment-manipulation setting, where models can rewrite evaluator code to trivially pass tests without solving the task, as a controlled testbed.