Improving Robustness In Sparse Autoencoders via Masked Regularization

ArXi:2604.06495v1 Announce Type: new Sparse autoencoders (SAEs) are widely used in mechanistic interpretability to project LLM activations onto sparse latent spaces. However, sparsity alone is an imperfect proxy for interpretability, and current