Exact Linear Attention

This paper introduces Exact Linear Attention (ELA), a mechanism that achieves linear computational complexity for Transformer attention by leveraging the exact decomposition property of kernel functions, without any approximation error. It identifies and addresses gradient explosion and token attention dilution in prior linear attention methods by imposing kernel constraints that ensure non-negativity, discriminability, and geometric interpretability. Several kernel functions are proposed, inclu