The Structural Origin of Attention Sink: Variance Discrepancy, Super Neurons, and Dimension Disparity

ArXi:2605.06611v1 Announce Type: new Despite the prevalence of the attention sink phenomenon in Large Language Models (LLMs), where initial tokens disproportionately monopolize attention scores, its structural origins remain elusive. This work provides a \textit{mechanistic explanation} for this phenomenon. First, we trace its root to the value aggregation process inherent in self-attention, which induces a systematic variance discrepancy. We further nstrate that this discrepancy is drastically amplified by the activation of super neurons within Feed-Forward Network (FFN) layers.