16 GB VRAM users, what model do we like best now?
r/LocalLLaMA
•
AI Hardware
Open Source AI
I'm finding Qwen 3.5 27b at IQ3 quants to be quite nice, I can usually fit around 32k (this is usually enough context for me since I dont use my local models for anything like coding) without issues and get around 40+ t/s on my RTX 4080 using ik_llama.cpp compiled for CUDA. I'm wondering if we could maybe get away with iq4 quants for the gemma 26b moe using turboquant for k cache. Being on 16gb kind of feels like edging, cause the quality drop off between iq4 and q4 feel pretty noticable to me. but you also give-up a ton of speed as soon as you need to start offloading layers.