Mitigating Many-shot Jailbreak Attacks with One Single Demonstration

ArXi:2605.08277v1 Announce Type: cross Many-shot jailbreaking (MSJ) causes safety-aligned language models to answer harmful queries by preceding them with many harmful question-answer nstrations. We study why this attack becomes stronger as the number of nstrations increases. Empirically, we find that MSJ induces a progressive activation drift: the representation of a fixed harmful query moves step by step away from the safety-aligned region as harmful nstrations are added.