What Drives Representation Steering? A Mechanistic Case Study on Steering Refusal

ArXi:2604.08524v1 Announce Type: cross Applying steering vectors to large language models (LLMs) is an efficient and effective model alignment technique, but we lack an interpretable explanation for how it works-- specifically, what internal mechanisms steering vectors affect and how this results in different model outputs. To investigate the causal mechanisms underlying the effectiveness of steering vectors, we conduct a comprehensive on refusal.