AI RESEARCH

Minimizing Collateral Damage in Activation Steering

arXiv CS.LG

ArXi:2605.01167v1 Announce Type: new Activation steering is a method for controlling Large Language Model (LLM) behavior by intervening in its internal representations to increase the alignment with a specific target feature direction. However, standard interventions, such as vector addition, often cause ``collateral damage", defined as unintended changes in the alignment of activations along other non-target feature directions. This damage occurs because standard methods implicitly assume the isotropy of non-target features.