Homogenized Transformers

ArXi:2604.01978v1 Announce Type: cross We study a random model of deep multi-head self-attention in which the weights are resampled independently across layers and heads, as at initialization of