AI RESEARCH
SplitZip: Ultra Fast Lossless KV Compression for Disaggregated LLM Serving
arXiv CS.LG
•
ArXi:2605.01708v1 Announce Type: cross Contemporary systems serving large language models (LLMs) have adopted prefill-decode disaggregation to better load-balance between the compute-bound prefill phase and the memory-bound decode phase. Under this design, prefill workers generate a KV cache that must be transferred to decode workers before token generation can begin. With these workers residing on different physical systems, this transfer becomes a significant bottleneck to serving LLMs at scale.