TurboQuant seems to work very well on Gemma 4 — and separately, per-layer outlier-aware K quantization is beating current public fork results on Qwen PPL

r/LocalLLaMA
Generative AI AI Hardware Open Source AI

I’ve been experimenting with TurboQuant KV cache quantization in llama.cpp (CPU + Metal) on Gemma 4 26B A4B-it Q4_K_M on an Apple M4 Pro 48GB, and the results look surprisingly strong. Gemma 4 findings On Gemma 4, QJL seems to work well, and FWHT as a structured rotation substitute also looks like a good fit for the large attention heads (dk=256/512