Gwen3.5-27b 8 bit vs 16 bit, 10 runs

The Aider benchmark on Qwen3.5-27b with the four combinations of model weights at bf16, fp8 and KV cache at bf16 and fp8. Each benchmark was repeated 10 times. The variance observed is not statistical significant. FAQ: Why not do 100 runs? Each run is 1+ hours and I have other projects. The variance is already too little and even if we did observe some small thing with a lot of runs, it might not actually mean anything. Why the Aider benchmark? It sucks! Maybe - but I am researching for the specific purpose of agentic coding and I find the benchmark easy to use.