SafeConstellations: Mitigating Over-Refusals in LLMs Through Task-Aware Representation Steering

ArXi:2508.11290v2 Announce Type: replace LLMs increasingly exhibit over-refusal behavior, where safety mechanisms cause models to reject benign instructions that seemingly resemble harmful content. This phenomenon diminishes utility in production applications that repeatedly rely on common prompt templates or applications that frequently rely on LLMs for specific tasks (e.g. sentiment analysis, language translation). Through extensive evaluation, we nstrate that LLMs persist in refusing inputs containing harmful content, even when they are reframed with tasks that have benign intent.