Kimi just published a paper replacing residual connections in transformers. results look legit

r/LocalLLaMA
Computer Vision NLP

Kimi (moonshot ai) dropped a paper on something called "attention residuals" that replaces the standard residual connection thats been in every transformer since resnet in 2015. The tldr: normal residual connections just stack everything from all previous layers together. layer 40 gets the accumulated output of layers 1-39 all piled up. the deeper you go the diluted earlier information gets. kimi calls this the "dilution problem." Their fix is to let each layer selectively attend to outputs from all previous layers instead of just taking the sum.