Re-Triggering Safeguards within LLMs for Jailbreak Detection

ArXi:2605.10611v1 Announce Type: cross This paper proposes a jailbreaking prompt detection method for large language models (LLMs) to defend against jailbreak attacks. Although recent LLMs are equipped with built-in safeguards, it remains possible to craft jailbreaking prompts that bypass them. We argue that such jailbreaking prompts are inherently fragile, and thus