Disentangling Intent from Role: Adversarial Self-Play for Persona-Invariant Safety Alignment

ArXi:2605.01899v1 Announce Type: new The growing capabilities of large language models (LLMs) have driven their widespread deployment across diverse domains, even in potentially high-risk scenarios. Despite advances in safety alignment techniques, current models remain vulnerable to emerging persona-based jailbreak attacks. Existing research on persona-based jailbreak has primarily focused on attack iterations, yet it lacks systemic and mechanistic constraints on the defense side.