Delta-KV for llama.cpp: near-lossless 4-bit KV cache on Llama 70B
r/LocalLLaMA
•
Generative AI
Open Source AI
I applied video compression to LLM inference and got 10,000x less quantization error at the same storage cost I’ve been experimenting with KV cache compression in LLM inference, and I ended up borrowing an idea from video codecs: don’t every frame in full but a keyframe, then deltas. Turns out this works surprisingly well for LLMs too. The idea During autoregressive decoding, consecutive tokens produce very similar KV cache values. So instead of quantizing the absolute KV values to 4-bit, I quantize the difference between consecutive tokens.