SAID: Safety-Aware Intent Defense via Prefix Probing for Large Language Models

ArXi:2510.20129v2 Announce Type: replace-cross Large Language Models (LLMs) remain vulnerable to jailbreak attacks, where adversarially crafted prompts induce policy-violating responses despite safety alignment. Existing defenses typically improve safety through external filtering, auxiliary guardrails, or decoding-time control. However, these interventions often reduce practical deployability because they may require additional model access,