Attention Sinks Induce Gradient Sinks

ArXi:2603.17771v1 Announce Type: new Attention sinks and massive activations are recurring and closely related phenomena in Transformer models. Existing studies have largely focused on the forward pass, making it unclear whether their connection is direct or mediated by a