[llama.cpp] New TurboQuant 3-bit KV Cache is insane! 17 t/s on Nemotron 30B using only 8GB VRAM (Full Windows/MSVC Build Guide + Auto-Script)

r/LocalLLaMA
Generative AI AI Hardware Open Source AI AI Research

Hi everyone. If you are running a GPU with 8GB VRAM (like my laptop RTX 4070), you’ve probably accepted that 30B+ models are "too slow" because of system memory swap. Not anymore. I’ve successfully compiled and tested the brand-new TurboQuant algorithm (released March 2026) on Windows. It uses advanced matrix rotations to compress the KV Cache into 3 bits with almost zero quality loss. The result is mind-blowing: I got Nemotron-Cascade-2 30B running at 17.04 tokens/sec with an 8k context window on a mobile card.