Laplacian Heads Improve Transformers by Smoothing Token Representations

ArXi:2602.09297v2 Announce Type: replace Transformers update token representations through multi-head attention and residual connections as $X \leftarrow X + \sum_{i} P^{(i)}XW_{V_i}W_{o_i}$, where $P^{(i)}$ is the softmax attention matrix in head $i$. We propose replacing a subset of $P^{(i)}$'s with the Laplacian $I - P^{(i)}$, giving $X \leftarrow X + \sum_{i \in \mathcal{A}} P^{(i)}XW_{V_i}W_{o_i} + \sum_{i \in \mathcal{L}} (I - P^{(i)})XW_{V_i}W_{o_i}$. Our proposal has two motivations.