Clustering in pure-attention hardmax transformers and its role in sentiment analysis

ArXi:2407.01602v2 Announce Type: replace-cross Transformers are extremely successful machine learning models whose mathematical properties remain poorly understood. Here, we rigorously characterize the behavior of transformers with hardmax self-attention and normalization sublayers as the number of layers tends to infinity.