RTX 5070 Ti + 9800X3D running Qwen3.6-35B-A3B at 79 t/s with 128K context, the --n-cpu-moe flag is the most important part.
r/LocalLLaMA
•
Generative AI
Open Source AI
Spent an evening dialing in Qwen3.6-35B-A3B on consumer hardware. Fun side note: I had Claude Opus 4.7 (just the $20 sub) build the config, launch the servers in the background, run the benchmarks, read the VRAM splits from the llama.cpp logs, and iterate on the tuning - basically did the whole thing autonomously. I just told it what hardware I have and what I wanted to run. Sharing because the common --cpu-moe advice is leaving 54% of your speed on the table on 16GB GPUs.