Qwen 3.6 27b IQ4_XS - 22 tp/s on RTX 5060TI 16b, 24k ctx
r/LocalLLaMA
•
Generative AI
AI Hardware
Open Source AI
Maybe it be helpful for someone: llama-server -m '/Qwen3.6-27B/Qwen3.6-27B-IQ4_XS.gguf' -ngl 999 -ctk q4_0 -ct q4_0 -b 128 -ub 128 -c 24000 Cant run this model with higher k quants on >8192ctx size. -ub & -b setted for 256 allowed me for max 16384 ctx The max sized for ctx i get is 24k. Disabled gnome let me use additional 300MiB. Its kinda nice, but ik that is very low usefull in many case. This GPU load 63/65 layers in this quants without quant context. But its still q4 so i think that is good enough. I used unsloth quant: submitted by /u/BazzyIm [link] [comments.