I found why KV cache INT4 breaks on some models (Qwen2-7B: ΔPPL +238) and built a 4-line fix no training, no calibration, 12 models tested up to 40B

r/LocalLLaMA
Open Source AI

I've been playing with KV cache INT4 quantization and noticed something weird: it works perfectly on some models and completely destroys others. Examples: Falcon-40B: ΔPPL +0.08 ✅ (basically free compression) OPT-13B: ΔPPL +0.28 ✅ Qwen2-7B: ΔPPL +238 ❌ (output becomes incoherent garbage) Pythia-6.9B: ΔPPL +22 ❌ Pythia-410M: ΔPPL +77 ❌ Same quantization method. Why does it break on some and not others? Root cause: two independent problems Token-wise norm variation - KV vector norms fluctuate 2-5x across tokens in Pre-LN models.