Qwen 27b MTP Config, Llama.cpp Single 3090

r/LocalLLaMA
Generative AI Open Source AI AI Research

What setup are you using for qwen 27b on a single 3090? Here's what I've started using today. It has to compact often but I'm worried about giving up accuracy and reliability with a lower quant: llama-server -m /Models/q3.6/Qwen3.6-27B-Q5_K_S.gguf -c 65536 -ngl -1 -t 8 -ctk q8_0 -ct q8_0 --chat-template-kwargs "{\"preserve_thinking\": true}" --spec-type draft-mtp --spec-draft-n-max 2 --fit off --mmproj /Models/q3.6/mmproj-Qwen3.6-27B-f16.gguf --no-mmproj-offload I'm getting around 65tk/s. I've also seen these recommendations: They seem to be using the q4 quant.