Disentangling MLP Neuron Weights in Vocabulary Space

ArXi:2604.06005v1 Announce Type: new Interpreting the information encoded in model weights remains a fundamental challenge in mechanistic interpretability. In this work, we