Gemma 4 31B vs Gemma 4 26B-A4B vs Qwen 3.5 27B — 30-question blind eval with Claude Opus 4.6 as judge

r/LocalLLaMA
Generative AI Open Source AI

Just finished a 3-way head-to-head. Sharing the raw results because this sub has been good about poking holes in methodology, and I'd rather get that feedback than pretend my setup is perfect. Setup 30 questions, 6 per category (code, reasoning, analysis, communication, meta-alignment) All three models answer the same question blind - no system prompt differences, same temperature Claude Opus 4.6 judges each response independently on a 0-10 scale with a structured rubric (not "which is better," but absolute scoring per response) Single judge, no swap-and-average this run - I know that.