DB-KSVD: Scalable Alternating Optimization for Disentangling High-Dimensional Embedding Spaces

ArXi:2505.18441v2 Announce Type: replace Dictionary learning has recently emerged as a promising approach for mechanistic interpretability of large transformer models. Disentangling high-dimensional transformer embeddings requires algorithms that scale to high-dimensional data with large sample sizes. Recent work has explored sparse autoencoders (SAEs) for this problem. However, SAEs use a simple linear encoder to solve the sparse encoding subproblem, which is known to be NP-hard.