Uncertainty-Aware Reward Discounting for Mitigating Reward Hacking

ArXi:2604.26360v1 Announce Type: cross Reinforcement learning (RL) systems typically optimize scalar reward functions that assume precise and reliable evaluation of outcomes. However, real-world objectives--especially those derived from human preferences--are often uncertain, context-dependent, and internally inconsistent. This mismatch can lead to alignment failures such as reward hacking, over-optimization, and overconfident behavior.