Finding Interpretable Prompt-Specific Circuits in Language Models

ArXi:2602.13483v2 Announce Type: replace Understanding the internal circuits that language models use to solve tasks remains a central challenge in mechanistic interpretability. A crucial part of finding circuits is understanding why each attention head attends where it does. To this end, we