80 tok/sec and 128K context on 12GB VRAM with Qwen3.6 35B A3B and llama.cpp MTP

r/LocalLLaMA
Generative AI AI Hardware Open Source AI AI Research

Just wanted to share my config in hopes of helping other 12GB GPU owners achieve what I see as very respectable token generation speeds with modest VRAM. Using the latest llama.cpp build + MTP PR, I got over 80 tok/sec on the benchmark found here: This is on an RTX 4070 Super, so results with other cards might vary. To run llama.cpp with MTP, you need to build it from source and add a draft PR that hasn't yet been merged with the master branch.