AI RESEARCH

Latent Adversarial Detection: Adaptive Probing of LLM Activations for Multi-Turn Attack Detection

arXiv CS.AI

ArXi:2604.28129v1 Announce Type: cross Multi-turn prompt injection follows a known attack path -- trust-building, pivoting, escalation but text-level defenses miss covert attacks where individual turns appear benign. We show this attack path leaves an activation-level signature in the model's residual stream: each phase shift moves the activation, producing a total path length far exceeding benign conversations. We call this adversarial restlessness. Five scalar trajectory features capturing this signal lift conversation-level detection from 76.2% to 93.8% on synthetic held-out data.