The Dual-Stream Transformer: Channelized Architecture for Interpretable Language Modeling

ArXi:2603.07461v1 Announce Type: cross Standard transformers entangle all computation in a single residual stream, obscuring which components perform which functions. We We measure this tradeoff on language modeling tasks at 29M parameters. Fully independent head mixing increases validation loss by 8\% relative to dense baselines. The recommended Kronecker mixing strategy, which permits scalar communication between heads while preserving within-head structure, costs only 2.5