Rethinking Attention: Polynomial Alternatives to Softmax in Transformers

ArXi:2410.18613v3 Announce Type: replace This paper questions whether the strong performance of softmax attention in transformers stems from producing a probability distribution over inputs. Instead, we argue that softmax's effectiveness lies in its implicit regularization of the Frobenius norm of the attention matrix, which stabilizes