Addressing divergent representations from causal interventions on neural networks

ArXi:2511.04638v5 Announce Type: replace A common approach to mechanistic interpretability is to causally manipulate model representations via targeted interventions in order to understand what those representations encode. Here we ask whether such interventions create out-of-distribution (divergent) representations, and whether this raises concerns about how faithful their resulting explanations are to the target model in its natural state.