Guiding Giants: Lightweight Controllers for Weighted Activation Steering in LLMs

ArXi:2505.20309v3 Announce Type: replace-cross Controlling undesirable Large Language Model (LLM) behaviors, such as the generation of unsafe content or failing to adhere to safety guidelines, often relies on costly fine-tuning. Activation steering provides an alternative for inference-time control, but existing methods typically lack fine-grained, adaptive mechanisms. We