AI RESEARCH
Conditional misalignment: common interventions can hide emergent misalignment behind contextual triggers
arXiv CS.LG
•
ArXi:2604.25891v1 Announce Type: new Finetuning a language model can lead to emergent misalignment (EM) [Betley, 2025b]. Models trained on a narrow distribution of misaligned behavior generalize to egregious behaviors when tested outside the