AI RESEARCH

Persona-Model Collapse in Emergent Misalignment

arXiv CS.AI

ArXi:2605.12850v1 Announce Type: cross Fine-tuning large language models on narrow data with harmful content produces broadly misaligned behavior on unrelated prompts, a phenomenon known as emergent misalignment. We propose that emergent misalignment involves persona-model collapse: deterioration of the model's internal capacity to simulate, differentiate, and maintain consistent characters.