The Recurrent Transformer: Greater Effective Depth and Efficient Decoding

ArXi:2604.21215v1 Announce Type: new Transformers process tokens in parallel but are temporally shallow: at position $t$, each layer attends to key-value pairs computed based on the previous layer, yielding a depth capped by the number of layers. Recurrent models offer unbounded temporal depth but suffer from optimization instability and historically underutilize modern accelerators. We