AI RESEARCH
KVServe: Service-Aware KV Cache Compression for Communication-Efficient Disaggregated LLM Serving
arXiv CS.AI
•
ArXi:2605.13734v1 Announce Type: cross LLMs are widely adopted in production, pushing inference systems to their limits. Disaggregated LLM serving (e.g., PD separation and KV state disaggregation) improves scalability and cost efficiency, but it also turns KV into an explicit payload crossing network and storage boundaries, making KV a dominant end-to-end bottleneck. Existing KV compression are typically static runtime configurations, despite production service context varies over time in workload mix, bandwidth, and SLO/quality budgets.