QUEST: A robust attention formulation using query-modulated spherical attention

ArXi:2604.00199v1 Announce Type: cross The Transformer model architecture has become one of the most widely used in deep learning and the attention mechanism is at its core. The standard attention formulation uses a softmax operation applied to a scaled dot product between query and key vectors. We explore the role played by norms of the queries and keys, which can cause