Fundamental Limits of Neural Network Sparsification: Evidence from Catastrophic Interpretability Collapse

ArXi:2603.18056v1 Announce Type: new Extreme neural network sparsification (90% activation reduction) presents a critical challenge for mechanistic interpretability: understanding whether interpretable features survive aggressive compression. This work investigates feature survival under severe capacity constraints in hybrid Variational Autoencoder--Sparse Autoencoder (VAE-SAE) architectures. We