Qwen 3.6-35B-A3B on dual 5060 Ti with --cpu-moe: 21.7 tok/s at 90K context, with benchmarks vs dense 3.5 and Coder variant
r/LocalLLaMA
•
Generative AI
Open Source AI
Qwen 3.6 dropped yesterday and I wanted to see if hybrid offloading actually earns its keep on this hardware. My box is two RTX 5060 Ti (32GB VRAM total) with 64GB system RAM. Not a workstation card in sight. I ran the same bench harness across three configs back to back so the comparison is at least fair on the hardware side. Stock ghcr.io/ggml-org/llama.cpp:server-cuda13 for the MoE runs, our TurboQuant build for the dense. Sequential: 10 iterations, 128 max tokens, 2 warmup. Stress: 4 concurrent workers, 256 max tokens, 5 min. Prompt is the same for all.