CachePrune: Teaching LLMs What Not to Follow via KV-Cache Editing

ArXi:2504.21228v3 Announce Type: replace-cross Large Language Models (LLMs) are susceptible to indirect prompt injection attacks, where the model inadvertently responds to instructions injected into the prompt context. This vulnerability stems from LLMs' inability to distinguish between data and instructions within a prompt. We propose CachePrune, which defends against this attack by identifying and pruning neurons associated with instruction-following during KV cache encoding of the prompt context.