Qwen3.6-35B-A3B - even in VRAM limited scenarios it can be better to use bigger quants than you'd expect!
r/LocalLLaMA
•
Generative AI
Open Source AI
So maybe this is a no-brainer to many experienced local LLM users but it was not obvious for me. I am running a 3070 8gb + 64gb DDR4. Pretty lightweight setup so I chose the smallest Q4 unsloth model Qwen3.6-35B-A3B-UD-IQ4_XS.gguf - which is ~18gb. It does run ok, and with some optimizations in llama.cpp I got about 25-30 tokens/s with a 32k context window. I did have some problems with looping during thinking so I tried a bigger Q4 model Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf - ~23gb. To my surprise, this is much faster! With a 128k context window, I am seeing 32 tokens/s.