AI RESEARCH
Safety Training Modulates Harmful Misalignment Under On-Policy RL, But Direction Depends on Environment Design
arXiv CS.LG
•
ArXi:2604.12500v1 Announce Type: new Specification gaming under Reinforcement Learning (RL) is known to cause LLMs to develop sycophantic, manipulative, or deceptive behavior, yet the conditions under which this occurs remain unclear. We train 11 instruction-tuned LLMs (0.5B--14B) with on-policy RL across 3 environments and find that model size acts as a safety buffer in some environments but enables greater harmful exploitation in others. Controlled ablations trace this reversal to environment-specific features such as role framing and implicit gameability cues.