Qwen 3.6 27B on 24GB VRAM setup: backend comparisons, quant choice and settings (llama.cpp, ik_llama.cpp, BeeLlama, vllm)
r/LocalLLaMA
•
Generative AI
Open Source AI
AI Research
AI Tools
TL;DR best setup I tested on a RTX 3090 24 GB: ik_llama.cpp + Qwen3.6-27B-MTP-IQ4_KS.gguf 156k context, q8_0/q8_0 KV, MTP, vision on CPU benchmark result on a ~5.9k prompt + 1k output: about 1261 tok/s prefill, 72.9 tok/s decode llama.cpp was a good start, BeeLlama worth testing, but ik_llama.cpp performed the best What was tested upstream llama.cpp: easy baseline and a good place to start beellama.cpp: promising on paper, but I could not reproduce the expected speed on my setup ik_llama.cpp: best decode/prefill, best VRAM fit I also spent time with vLLM / club-3090, but I am leaving it.