Transformers with Selective Access to Early Representations [R]

Hello everyone. I’m excited to share our new paper! Figure 1: Comparison Across Architectures A lot of recent Transformer variants try to improve information flow across depth by exposing later layers to earlier representations. You may have recently heard about methods like DenseFormer, MUDDFormer, and HyperConnections, which add dense or dynamic cross-layer pathways. These are expressive, but they can also come with meaningful throughput and memory costs.