IH-Challenge: A Training Dataset to Improve Instruction Hierarchy on Frontier LLMs

ArXi:2603.10521v1 Announce Type: new Instruction hierarchy (IH) defines how LLMs prioritize system, developer, user, and tool instructions under conflict, providing a concrete, trust-ordered policy for resolving instruction conflicts. IH is key to defending against jailbreaks, system prompt extractions, and agentic prompt injections. However, robust IH behavior is difficult to train: IH failures can be confounded with instruction-following failures, conflicts can be nuanced, and models can dataset, to address these difficulties.