AI RESEARCH

Conditional misalignment: common interventions can hide emergent misalignment behind contextual triggers

arXiv CS.LG

ArXi:2604.25891v1 Announce Type: new Finetuning a language model can lead to emergent misalignment (EM) [Betley, 2025b]. Models trained on a narrow distribution of misaligned behavior generalize to egregious behaviors when tested outside the