3.5× KV cache compression with +0.012 PPL (Mistral 7B, no retraining)

r/LocalLLaMA
Generative AI AI Hardware Open Source AI AI Research

3.5× KV cache compression with minimal quality loss (+0.012 PPL on Mistral 7B). PR under review in NVIDIA kvpress - happy to share details or benchmarks. submitted by /u/Spirited-Toe-3988 [link] [comments]