Universal Sparse Autoencoders: Interpretable Cross-Model Concept Alignment

ArXi:2502.03714v2 Announce Type: replace-cross We present Universal Sparse Autoencoders (USAEs), a framework for uncovering and aligning interpretable concepts spanning multiple pretrained deep neural networks. Unlike existing concept-based interpretability methods, which focus on a single model, USAEs jointly learn a universal concept space that can reconstruct and interpret the internal activations of multiple models at once.