Choosing the right basis for interpretability: Psychophysical comparison between neuron-based and dictionary-based representations

ArXi:2411.03993v2 Announce Type: replace Interpretability research often adopts a neuron-centric lens, treating individual neurons as the fundamental units of explanation. However, neuron-level explanations can be undermined by superposition, where single units respond to mixtures of unrelated patterns. Dictionary learning methods, such as sparse autoencoders and non-negative matrix factorization, offer a promising alternative by learning a new basis over layer activations.