Transformers learn variable-order Markov chains in-context

ArXi:2410.05493v2 Announce Type: replace We study transformers' in-context learning of variable-length Marko chains (VOMCs), focusing on the finite-sample accuracy as the number of in-context examples increases. Compared to fixed-order Marko chains (FOMCs), learning VOMCs is substantially challenging due to the additional structural learning component. The problem is naturally suited to a Bayesian formulation, where the context-tree weighting (CTW) algorithm, originally developed in the information theory community for universal data compression, provides an optimal solution.