Transformers for Learning on Noisy and Task-Level Manifolds: Approximation and Generalization Insights

ArXi:2505.03205v3 Announce Type: replace Transformers serve as the foundational architecture for large language and video generation models, such as GPT, BERT, SORA and their successors. Empirical studies have nstrated that real-world data and learning tasks exhibit low-dimensional structures, along with some noise or measurement error. The performance of transformers tends to depend on the intrinsic dimension of the data/tasks, though theoretical understandings remain largely unexplored for transformers.