Exemplar Partitioning for Mechanistic Interpretability

ArXi:2605.14347v1 Announce Type: new We characterise EP as an interpretability object via targeted nstrations of properties newly accessible through this construction, plus one head-to-head benchmark. In Gemma-2-2B, EP dictionary regions are interpretable and causal interventions: refusal in instruction-tuned Gemma concentrates in a region whose exemplar ablation can collapse held-out refusal. Cross-checkpoint matching between base and instruction-tuned dictionaries separates the directions preserved through finetuning from those.