AI RESEARCH

Safety Is Not Universal: The Selective Safety Trap in LLM Alignment

arXiv CS.AI

ArXi:2601.04389v2 Announce Type: replace-cross Current safety evaluations of large language models (LLMs) create a dangerous illusion of universal protection by aggregating harms under generic categories such as "Identity Hate", obscuring vulnerabilities toward specific populations. In this work, we expose the Selective Safety Trap: a systemic failure mode where models robustly defend specific populations while leaving underrepresented communities highly vulnerable to identical adversarial attacks. To systematically audit this phenomenon, we.