Stronger Enforcement of Instruction Hierarchy via Augmented Intermediate Representations

ArXi:2505.18907v2 Announce Type: replace-cross Prompt injection attacks are a critical security vulnerability in large language models (LLMs), allowing attackers to hijack model behavior by injecting malicious instructions within the input context. Recent defense mechanisms have leveraged an Instruction Hierarchy (IH) Signal, often implemented through special delimiter tokens or additive embeddings to denote the privilege level of input tokens.