AI RESEARCH

CompactAttention: Accelerating Chunked Prefill with Block-Union KV Selection

arXiv CS.CL

ArXi:2605.16839v1 Announce Type: new Chunked prefill has become a widely adopted serving strategy for long-context large language models, but efficient attention computation in this regime remains challenging. Existing sparse attention methods are primarily designed for one-shot prefill and do not translate efficiently to chunked prefill: block-sparse kernels lose efficiency when the query length is limited by the chunk size, while fine-grained pattern search becomes costly when repeated over the accumulated KV cache at every chunk.