Why Softmax Attention Outperforms Linear Attention

ArXi:2310.11685v2 Announce Type: replace-cross Large transformer models have achieved state-of-the-art results in numerous natural language processing tasks. Among the pivotal components of the transformer architecture, the attention mechanism plays a crucial role in capturing token interactions within sequences through the utilization of softmax function. Conversely, linear attention presents a computationally efficient alternative by approximating the softmax operation with linear complexity.