Interpretability without actionability: mechanistic methods cannot correct language model errors despite near-perfect internal representations

ArXi:2603.18353v1 Announce Type: new Language models encode task-relevant knowledge in internal representations that far exceeds their output performance, but whether mechanistic interpretability methods can bridge this knowledge-action gap has not been systematically tested.