A3 : an Analytical Low-Rank Approximation Framework for Attention

ArXi:2505.12942v4 Announce Type: replace-cross Large language models have nstrated remarkable performance; however, their massive parameter counts make deployment highly expensive. Low-rank approximation offers a promising compression solution, yet existing approaches have two main limitations: (1) They focus on minimizing the output error of individual linear layers, without considering the architectural characteristics of Transformers, and (2) they decompose a large weight matrix into two small low-rank matrices.