Weight Patching: Toward Source-Level Mechanistic Localization in LLMs

ArXi:2604.13694v1 Announce Type: new Mechanistic interpretability seeks to localize model behavior to the internal components that causally realize it. Prior work has advanced activation-space localization and causal tracing, but modules that appear important in activation space may merely aggregate or amplify upstream signals rather than encode the target capability in their own parameters.