AI RESEARCH

Compressed Convolutional Attention: Efficient Attention in a Compressed Latent Space

arXiv CS.AI

ArXi:2510.04476v2 Announce Type: replace-cross Multi-headed Attention's (MHA) quadratic compute and linearly growing KV-cache make long-context transformers expensive to train and serve. Prior works such as Grouped Query Attention (GQA) and Multi-Latent Attention (MLA) shrink the cache, speeding decode, but leave compute, which determines prefill and