Transformers with Selective Access to Early Representations

ArXi:2605.03953v1 Announce Type: new Several recent Transformer architectures expose later layers to representations computed in the earliest layers, motivated by the observation that low-level features can become harder to recover as the residual stream is repeatedly transformed through depth. The cheapest among these methods add static value residuals: learned mixing coefficients that expose the first-layer value projection V_1 uniformly across tokens and heads. expressive dense or dynamic alternatives recover finer-grained access, but at higher memory cost and lower throughput.