AI RESEARCH
Interpretability without actionability: mechanistic methods cannot correct language model errors despite near-perfect internal representations
arXiv CS.AI
•
ArXi:2603.18353v1 Announce Type: new Language models encode task-relevant knowledge in internal representations that far exceeds their output performance, but whether mechanistic interpretability methods can bridge this knowledge-action gap has not been systematically tested.