Bypassing Prompt Injection Detectors through Evasive Injections

ArXi:2602.00750v2 Announce Type: replace-cross Large language models (LLMs) are increasingly used in interactive and retrieval-augmented systems, but they remain vulnerable to prompt injection attacks, where injected secondary prompts force the model to deviate from the user's instructions to execute a potentially malicious task defined by the adversary. Recent work shows that ML models trained on activation shifts from LLMs' hidden layers can detect such drift. In this paper, we nstrate that these detectors are not robust to adaptive adversaries.