Running a 35B Model Locally with TurboQuant — What’s Actually Possible Right Now

Towards AI
Generative AI Open Source AI AI Research AI Tools

Before diving in, one important distinction: TurboQuant does not quantize model weights. It compresses the KV cache at inference time. This means it doesn’t replace tools like GGUF or AWQ - it stacks on top of them. To understand why that matters, you need to understand what Q4_K_M actually is and why “35B” doesn’t mean the same thing across different models. First: what does Q4_K_M even mean? When you download a model from HuggingFace or Ollama, you’ll see filenames like qwen2.5-32b-instruct-Q4_K_M.gguf. GGUF is the file format used by llama.cpp and most local inference tools.