FastDMS: 6.4X KV-cache compression running faster than vLLM BF16/FP8
r/LocalLLaMA
•
Generative AI
AI Hardware
Open Source AI
AI Tools
Last year researchers affiliated with NVIDIA, University of Warsaw, and University of Edinburgh published Dynamic Memory Sparsification (DMS), a KV-cache sparsification technique using learned per-head token eviction, reporting up to 8x KV-cache compression. I found the results intriguing to build a small reference implementation and trainer to sanity-check the idea. On WikiText-2 with Llama 3.2 1B, I was able to get a rough replication: Configuration PPL Delta KLD (nats/tok) Compression Vanilla Llama-3.2-1B 9.226 - - 1x DMS (trained, eviction active) 9.200 -0.28% 0.026 6.4x.