Engineering Verifiable Modularity in Transformers via Per-Layer Supervision

ArXi:2603.18029v1 Announce Type: cross Transformers resist surgical control. Ablating an attention head identified as critical for capitalization produces minimal behavioral change because distributed redundancy compensates for damage. This Hydra effect renders interpretability illusory: we may identify components through correlation, but cannot predict or control their causal role. We nstrate that architectural interventions can expose hidden modularity.