Merging Triggers, Breaking Backdoors: Defensive Poisoning for Instruction-Tuned Language Models

ArXi:2601.04448v2 Announce Type: replace-cross Large Language Models (LLMs) have greatly advanced Natural Language Processing (NLP), particularly through instruction tuning, which enables broad task generalization without additional fine-tuning. However, their reliance on large-scale datasets-often collected from human or web sources-makes them vulnerable to backdoor attacks, where adversaries poison a small subset of data to implant hidden behaviors. Despite this growing risk, defenses for instruction-tuned models remain underexplored.