Hidden Heroes and Gradient Bloats: Layer-Wise Redundancy Inverts Attribution in Transformers

ArXi:2602.01442v3 Announce Type: replace-cross Gradient-based attribution is the workhorse of mechanistic interpretability, yet whether it reliably tracks causal importance at the component level remains largely untested. We causally evaluate this assumption across two algorithmic tasks and up to 10 random seeds, uncovering a systematic, layer-wise failure: gradient attribution consistently overvalues early-layer \textbf{Gradient Bloats} and undervalues late-layer \textbf{Hidden Heroes