AI RESEARCH

Multi-Headed Transformer Architectures as Time-dependent Wasserstein Gradient Flows

arXiv CS.LG

ArXi:2605.18870v1 Announce Type: new In recent years, transformer architectures have revolutionized the field of language processing, opening the door to previously unforeseen possibilities. However, from a theoretical point of view, the mathematical models proposed in the literature often lack direct contact with the actual architectures and depend on strong simplifying assumptions.