From Refusal Tokens to Refusal Control: Discovering and Steering Category-Specific Refusal Directions

ArXi:2603.13359v1 Announce Type: new Language models are commonly fine-tuned for safety alignment to refuse harmful prompts. One approach fine-tunes them to generate categorical refusal tokens that distinguish different refusal types before responding. In this work, we leverage a version of Llama 3 8B fine-tuned with these categorical refusal tokens to enable inference-time control over fine-grained refusal behavior, improving both safety and reliability.