Single-Position Intervention Fails: Distributed Output Templates Drive In-Context Learning

ArXi:2605.04061v1 Announce Type: new Understanding how large language models encode task identity from few-shot nstrations is a central open problem in mechanistic interpretability. Prior work uses linear probing to localize task representations, reporting high classification accuracy at specific layers. We reveal a striking dissociation: probing accuracy completely fails to predict causal importance. Single-position activation intervention achieves 0% task transfer across all 28 layers of Llama-3.2-3B-despite 100% probing accuracy at those same positions.