AI RESEARCH
Attention Sinks Induce Gradient Sinks
arXiv CS.LG
•
ArXi:2603.17771v1 Announce Type: new Attention sinks and massive activations are recurring and closely related phenomena in Transformer models. Existing studies have largely focused on the forward pass, making it unclear whether their connection is direct or mediated by a