Qwen 3.6 35B crushes Gemma 4 26B on my tests

I have a personal eval harness: A repo with around 30k lines of code that has 37 intentional issues for LLMs to debug and address through an agentic setup (I use OpenCode) A subset of the harness also has the LLM extract key information from reasonably large PDFs (40-60 pages), summarize and evaluate its findings. Long story short: The harness tests the following LLM attributes: - Agentic capabilities - Coding - Image-to-text synthesis - Instruction following - Reasoning Both models at UD-Q4_K_XL for a fair baseline running optimal sampling params.