AI RESEARCH
The Rate-Distortion-Polysemanticity Tradeoff in SAEs
arXiv CS.LG
•
ArXi:2605.14694v1 Announce Type: new Sparse Autoencoders (SAEs) that can accurately reconstruct their input (minimizing distortion) by making efficient use of few features (minimizing the rate) often fail to learn monosemantic representations (highly interpretable), limiting their usefulness for mechanistic interpretability. In this paper, we characterise this tension in learning faithful, efficient, and interpretable explanations,