Transformers Learn Latent Mixture Models In-Context via Mirror Descent

ArXi:2604.10848v1 Announce Type: new Sequence modelling requires determining which past tokens are causally relevant from the context and their importance: a process inherent to the attention layers in transformers, yet whose underlying learned mechanisms remain poorly understood. In this work, we formalize the task of estimating token importance as an in-context learning problem by