AI RESEARCH

CorrSteer: Generation-Time LLM Steering via Correlated Sparse Autoencoder Features

arXiv CS.LG

ArXi:2508.12535v3 Announce Type: replace-cross Sparse Autoencoders (SAEs) can extract interpretable features from large language models (LLMs) without supervision. However, their effectiveness in downstream steering tasks is limited by the requirement for contrastive datasets or large activation storage. To address these limitations, we propose CorrSteer, which selects features by correlating sample correctness with SAE activations from generated tokens at inference time. This approach uses only inference-time activations to extract relevant features, thereby reducing spurious correlations.