Attention is all you need: Kimi replaces residual connections with attention

TL;DR Transformers already use attention to decide which tokens matter. Unlike DeepSeek's mhc, Kimi's paper shows you should also use attention to decide which layers matter, replacing the decades-old residual connection (which treats every layer equally) with a learned mechanism that lets each layer selectively retrieve what it actually needs from earlier layers. Results: Scaling law experiments reveal a consistent 1.25× compute advantage across varying model sizes. Attention is still all you need, just now in a new dimension. submitted by /u/InternationalAsk1490 [link] [comments.