Zipage: Maintain High Request Concurrency for LLM Reasoning through Compressed PagedAttention

ArXi:2603.08743v1 Announce Type: cross With reasoning becoming the generative paradigm for large language models (LLMs), the memory bottleneck caused by KV cache during the decoding phase has become a critical factor limiting high-concurrency service. Although existing KV cache eviction methods address the memory issue, most of them are impractical for industrial-grade applications. This paper