Running Qwen3.5 / Qwen3.6 with NextN MTP (Multi-Token Prediction) speculative decode in llama.cpp — single RTX 3090 Ti GPU guide

r/LocalLLaMA
Generative AI AI Hardware Open Source AI AI Research

I was asked for this guide, so here it is. Some overlap with someone else’s post from yesterday. YMMV! Too busy with work to write myself, so I asked Opus to write for me (I have validated the content!). I’m sure there will be debate over using q4 blah blah. I’m happy with how it works with my models. I am happy to create higher q models as far as my hardware allows, if asked! NextN MTP gives ~2.9× decode on the Qwen3.5/3.6 family vs vanilla, zero quality loss (head ships with the model). Heavy MoE arch like 35B-A3B hits ~150 tok/s on a 3090 Ti.