Can't replicate Reddit numbers with Qwen 27B on a 3090TI.
r/LocalLLaMA
•
Generative AI
Open Source AI
I feel like i'm going insane. I see people here posting 30 - 100+ tok/s (100+ being with speculative decoding) on a 3090 with Qwen 3.6 27B. I'm trying to replicate this but my performance numbers are nowhere near that. I have tried llama.cpp with Unsloth's Q4XL and Q4_K_M GGUF's. On that i got like 10 tok/s at 50k context. I also tried using ik_llama.cpp with this smaller gguf: which is about 1GB smaller than Unlosth's GGUF and with that combination i get about 18-19 tok/s on 50k context.