AI RESEARCH

Descriptive Collision in Sparse Autoencoder Auto-Interpretability: When One Explanation Describes Many Features

arXiv CS.LG

ArXi:2605.12874v1 Announce Type: new Sparse autoencoders (SAEs) are now standard tools for decomposing language model activations into interpretable features, and automated interpretability pipelines routinely assign each feature a short natural-language explanation. Existing critiques of this practice focus on polysemanticity -- one feature with many meanings -- or on whether explanations predict activations. We identify a complementary, structurally distinct problem we call descriptive collision: many distinct SAE features admit the same explanation.