AI SAFETY & ETHICS

Confusion around the term reward hacking

LessWrong AI

Summary: "Reward hacking" commonly refers to two different phenomena: misspecified-reward exploitation, where RL reinforces undesired behaviors that score highly under the reward function, and task gaming, where models cheat on tasks specified to them in-context. While these often coincide, they can come apart, require distinct interventions, and lead to distinct threat models. Using a blanket term for both can obscure this. Distinct phenomena qualify as reward hacking The term commonly points to two distinct phenomena. I refer to them as misspecified-reward exploitation and task gaming.