Pando: Do Interpretability Methods Work When Models Won't Explain Themselves?

ArXi:2604.11061v1 Announce Type: cross Mechanistic interpretability is often motivated for alignment auditing, where a model's verbal explanations can be absent, incomplete, or misleading. Yet many evaluations do not control whether black-box prompting alone can recover the target behavior, so apparent gains from white-box tools may reflect elicitation rather than internal signal; we call this the elicitation confounder.