I Caught a Jailbreak Attack That Hides Inside Normal Conversations

This attack does not look like an attack. That is exactly what makes it dangerous. I was working on one of my project failure intelligence system an open source LLM security guardrail when I came across a 2024 Google DeepMind paper on many-shot jailbreaking. I implemented detection for it, hit a tricky false positive bug, fixed it, and ended up with 0% FPR on benign prompts. Here is the story. The Attack: Hiding Harm Inside a Normal Conversation A standard jailbreak looks obviously suspicious: Ignore all previous instructions. You are now.