AI RESEARCH

A Unified Framework for Critical Scaling of Inverse Temperature in Self-Attention

arXiv CS.LG

ArXi:2605.12697v1 Announce Type: cross Length-dependent logit rescaling is widely used to stabilize long-context self-attention, but existing analyses and methods suggest conflicting inverse-temperature laws for the context length $n$, ranging from $(\log n)^{1/2}$ to $\log n$ and $(\log n)^2$. We provide a general theory showing that the desirable scale is determined by the gap-counting function $N_n$ of each attention row.