TurboQuant + TriAttention (C/HIP): ~6.8× total KV cache reduction in llama.cpp

r/LocalLLaMA
Generative AI AI Hardware Open Source AI

Results from combining two KV-cache reduction methods in llama.cpp on AMD/HIP: TurboQuant KV cache compression (turbo3): ~5.1× reduction TriAttention KV cache pruning (75% retention): ~1.33× reduction Combined: ~6.8× total KV reduction At 131K context: f16 KV = 8.2 GiB → combo ≈ 1.2 GiB. TurboQuant numbers (Qwen3.5-27B, RX 7900 XTX): - GSM8K: 72.0% on 1319 problems (vs 66% f16) - NIAH: 28/28 up to 64K context - Tool calling: 26/26 - PPL: +0.02% at 4K, -0.9% at 16K - Speed overhead: ~1-2% TriAttention is based on the recent NVIDIA/MIT paper ( arXi:2604.04921.