Tesla P40 running qwen 3.6

r/LocalLLaMA
Generative AI Open Source AI

Does anyone know why qwen 3.6 MTP spec decoding won't work with Tesla P40 when the K cache is quantized? I was able to get mtp qwen 3.6 27B Q5 running at 20t/s on my tesla p40. But only after removing any quantization of the K cache (running at F16). I had no trouble running turbo3 k cache without MTP on the turboquant fork of llama.cpp, but using the atomic fork to get MTP working it would only give garbage output characters with any kind of q4_0, turbo3 on K cache. Anyone know what's up with that? Here's my powershell start script $en:TERM = "xterm-256color" $Host.