Gemma4 26b & E4B are crazy good, and replaced Qwen for me!
r/LocalLLaMA
•
Generative AI
Open Source AI
AI Research
My pre-gemma 4 setup was as follows: Llama-swap, open-webui, and Claude code router on 2 RTX 3090s + 1 P40 (My third 3090 died, RIP) and 128gb of system memory Qwen 3.5 4B for semantic routing to the following models, with n_cpu_moe where needed: Qwen 3.5 30b A3B Q8XL - For general chat, basic document tasks, web search, anything huge context that didn't require reasoning. It's also hardcoded to use this model when my latest query contains "quick" Qwen 3.5 27b Q8XL - used as a "higher precision" model to sit in for A3B, especially when reasoning was needed.