The Effect of Attention Head Count on Transformer Approximation

ArXi:2510.06662v2 Announce Type: replace Transformer has become the dominant architecture for sequence modeling, yet a detailed understanding of how its structural parameters influence expressive power remains limited. In this work, we study the approximation properties of transformers, with particular emphasis on the role of the number of attention heads. Our analysis begins with the