RTX 5080 16GB: Qwen3.6 35B MoE at 128k context — 56 tok/s, and why MTP doesn't help

r/LocalLLaMA
Generative AI Open Source AI

MTP (Multi-Token Prediction) just merged into mainline llama.cpp at b9190. I promised u/WarthogConfident4039 a Qwen3.6 benchmarking round. Three configs, tested at real coding-agent context lengths (not just 512 tokens). The main finding surprised me. TL;DR: 35B Q4_K_XL, no MTP, --fit-target 1536, 131k context. That's the config. 56 tok/s generation, 1,584 tok/s prompt processing at 128k context. MTP doesn't help at 128k - both converge to the same speed. Skip the complexity.