Interpretable and Steerable Concept Bottleneck Sparse Autoencoders

ArXi:2512.10805v2 Announce Type: replace-cross Sparse autoencoders (SAEs) promise a unified approach for mechanistic interpretability, concept discovery, and model steering in LLMs and LVLMs. However, realizing this potential requires learned features to be both interpretable and steerable. To that end, we