AI RESEARCH
KVSculpt: KV Cache Compression as Distillation
arXiv CS.AI
•
ArXi:2603.27819v1 Announce Type: cross KV cache compression is critical for efficient long-context LLM inference. Approaches that reduce the per-pair footprint -- quantization and low-rank decomposition -- are orthogonal to those that reduce the sequence length of the cache. Along the sequence-length dimension, existing methods range from pure eviction -- selecting which KV pairs to keep -- to merging, which combines similar pairs into fewer ones. Both remain anchored to the original cache entries.