AI RESEARCH
VeriCache: Turning Lossy KV Cache into Lossless LLM Inference
arXiv CS.LG
•
ArXi:2605.17613v1 Announce Type: cross The large size of the KV cache has become a major bottleneck for serving LLMs with increasing context lengths. In response, many KV cache compression methods, such as token dropping and quantization, have been proposed. However, almost all of these methods are inherently lossy-despite minimal accuracy degradation for short outputs, their outputs increasingly diverge from full-KV-cache outputs as tokens are decoded, which leads to catastrophic failures in code generation and tool calling.