RepIt: Steering Language Models with Concept-Specific Refusal Vectors

ArXi:2509.13281v5 Announce Type: replace Current safety evaluations of language models rely on benchmark-based assessments that may miss localized vulnerabilities. We present RepIt, a simple and data-efficient framework for isolating concept-specific representations in LM activations. While existing steering methods already achieve high attack success rates through broad interventions, RepIt enables a concerning capability: selective suppression of refusal on targeted concepts while preserving refusal elsewhere.