CLT-Forge: A Scalable Library for Cross-Layer Transcoders and Attribution Graphs

ArXi:2603.21014v1 Announce Type: new Mechanistic interpretability seeks to understand how Large Language Models (LLMs) represent and process information. Recent approaches based on dictionary learning and transcoders enable representing model computation in terms of sparse, interpretable features and their interactions, giving rise to feature attribution graphs. However, these graphs are often large and redundant, limiting their interpretability in practice.