AI RESEARCH

Residual Stream Duality in Modern Transformer Architectures

arXiv CS.AI

ArXi:2603.16039v1 Announce Type: cross Recent work has made clear that the residual pathway is not mere optimization plumbing; it is part of the model's representational machinery. We agree, but argue that the cleanest way to organize this design space is through a two-axis view of the Transformer. A decoder evolves information along two ordered dimensions: sequence position and layer depth. Self-attention already provides adaptive mixing along the sequence axis, whereas the residual stream usually performs fixed addition along the depth axis.