Differentiable Faithfulness Alignment for Cross-Model Circuit Transfer

ArXi:2604.24302v1 Announce Type: new Mechanistic interpretability has made it possible to localize circuits underlying specific behaviors in language models, but existing methods are expensive, model-specific, and difficult to scale to larger architectures. We