Bucketing the Good Apples: A Method for Diagnosing and Improving Causal Abstraction

ArXi:2605.02234v1 Announce Type: cross We present a method for diagnosing interpretation in neural networks by identifying an input subspace where a proposed interpretation is highly faithful. Our method is particularly useful for causal-abstraction-style interpretability, where a high-level causal hypothesis is evaluated by interchange interventions.