Qwen 35B-A3B is very usable with 12GB of VRAM
r/LocalLLaMA
•
Generative AI
AI Hardware
Open Source AI
Hardware: RTX 3060 12GB 32GB DDR4-3200 Windows CUDA 13.x Model: Qwen3.6-35B-A3B-MTP-IQ4_XS.gguf The model is a 35B MoE, so -ncmoe matters a lot. Lower -ncmoe means MoE blocks stay on GPU. Main takeaway 12GB VRAM feels like a very practical size for this model. It lets you keep enough MoE blocks on GPU that plain decoding becomes quite strong, while still leaving room for useful context sizes like 16k/32k. For prompt processing / prefill, I trust the llama-bench numbers than llama-cli ’s interactive Prompt: line, because llama-bench gives a cleaner pp512 measurement.