Refusal in open-weights models looks like a sparse gate -> amplifier circuit, and generalizes across 12 models from 6 labs (2B-72B)

r/LocalLLaMA
AI Research

Paper: I've been trying to understand where refusal actually lives. How it works mechanistically. Arditi et al showed refusal can be steered with a single direction. What I looked at here is the mechanistic question: what circuit creates and amplifies that direction? Main result: Across 12 models from 6 labs, I keep finding a sparse gate-amplifier pattern. A mid-layer 'gate' attention head reads a detection-layer representation and writes a routing vector. Later 'amplifier' attention heads then boost that signal towards refusal / censorship behavior.