Qwen 3.6 27B MTP on v100 32GB: 54 t/s

r/LocalLLaMA
Generative AI Open Source AI

Just a quick note that I got a nice result using am17an's MTP branch of llama.cpp on v100 32GB SXM module using one of those pcie card adapters. Pulled and built in one shot, and llama-server ran without a hitch. Tested using am17an's MTP GGUF, q8_0 k cache and 200k cache limit acting as vscode copilot. 29-30 t/s without MTP 54-55t/s with MTP, using 150W power limit on the card. Falls to 40-45 t/s after choking down 50k tokens, but doing great with tool calls, sub agents, and made some very insightful code reviews and refactors.