Control Reinforcement Learning: Interpretable Token-Level Steering of LLMs via Sparse Autoencoder Features

ArXi:2602.10437v3 Announce Type: replace Sparse autoencoders (SAEs) decompose language model activations into interpretable features, but existing methods reveal only which features activate, not which change model outputs when amplified. We