AI RESEARCH

Steered LLM Activations are Non-Surjective

arXiv CS.AI

ArXi:2604.09839v1 Announce Type: new Activation steering is a popular white-box control technique that modifies model activations to elicit an abstract change in output behavior. It has also become a standard tool in interpretability (e.g., probing truthfulness, or translating activations into human-readable explanations and safety research (e.g., studying jailbreakability). However, it is unclear whether steered activation states are realizable by any textual prompt.