Weights to Code: Extracting Interpretable Algorithms from the Discrete Transformer

ArXi:2601.05770v2 Announce Type: replace Algorithm extraction aims to synthesize executable programs directly from models trained on algorithmic tasks, enabling de novo algorithm discovery without relying on human-written code. However, applying this paradigm to Transformer is hindered by representation entanglement (e.g., superposition), where entangled features encoded in overlapping directions obstruct the recovery of symbolic expressions.