Mechanistic Indicators of Steering Effectiveness in Large Language Models

ArXi:2602.01716v3 Announce Type: replace Activation-based steering enables Large Language Models (LLMs) to exhibit targeted behaviors by intervening on intermediate activations without re