AI RESEARCH
Defensive Refusal Bias: How Safety Alignment Fails Cyber Defenders
arXiv CS.AI
•
ArXi:2603.01246v2 Announce Type: replace-cross Safety alignment in large language models (LLMs), particularly for cybersecurity tasks, primarily focuses on preventing misuse. While this approach reduces direct harm, it obscures a complementary failure mode: denial of assistance to legitimate defenders. We study Defensive Refusal Bias -- the tendency of safety-tuned frontier LLMs to refuse assistance for authorized defensive cybersecurity tasks when those tasks include similar language to an offensive cyber task.