I Caught a Jailbreak Attack That Hides Inside Normal Conversations

Dev.to AI
Generative AI AI Safety Open Source AI AI Research

This attack does not look like an attack. That is exactly what makes it dangerous. I was working on one of my project failure intelligence system an open source LLM security guardrail when I came across a 2024 Google DeepMind paper on many-shot jailbreaking. I implemented detection for it, hit a tricky false positive bug, fixed it, and ended up with 0% FPR on benign prompts. Here is the story. The Attack: Hiding Harm Inside a Normal Conversation A standard jailbreak looks obviously suspicious: Ignore all previous instructions. You are now.