From Data Statistics to Feature Geometry: How Correlations Shape Superposition

ArXi:2603.09972v1 Announce Type: cross A central idea in mechanistic interpretability is that neural networks represent features than they have dimensions, arranging them in superposition to form an over-complete basis. This framing has been influential, motivating dictionary learning approaches such as sparse autoencoders. However, superposition has mostly been studied in idealized settings where features are sparse and uncorrelated. In these settings, superposition is typically understood as