AI RESEARCH
Rethinking Attention: Polynomial Alternatives to Softmax in Transformers
arXiv CS.LG
•
ArXi:2410.18613v3 Announce Type: replace This paper questions whether the strong performance of softmax attention in transformers stems from producing a probability distribution over inputs. Instead, we argue that softmax's effectiveness lies in its implicit regularization of the Frobenius norm of the attention matrix, which stabilizes