AI SAFETY & ETHICS
(Some) Natural Emergent Misalignment from Reward Hacking in Non-Production RL
LessWrong AI
•
Authors: Satvik Golechha*, Sid Black*, Joseph Bloom * Equal Contribution. This work was done as part of the Model Transparency team at the UK AI Security Institute (AISI). Executive Summary In Natural Emergent Misalignment from Reward Hacking in Production RL (MacDiarmid, 2025), Anthropic recently nstrated that language models that through to RL on coding tasks, where models that discover reward hacks subsequently exhibit misaligned behaviour on unrelated evaluations: Figure 0: The experimental pipeline from MacDiarmid that we reproduce.