Causality is Key for Interpretability Claims to Generalise

ArXi:2602.16698v2 Announce Type: replace Interpretability research on large language models (LLMs) has yielded important insights into model behaviour, yet recurring pitfalls persist: findings that do not generalise, and causal interpretations that outrun the evidence. Our position is that causal inference specifies what constitutes a valid mapping from model activations to invariant high-level structures, the data or assumptions needed to achieve it, and the inferences it can. Specifically, Pearl's causal hierarchy clarifies what an interpretability study can justify.