Qwen 3.6-35B-A3B KV cache bench: f16 vs q8_0 vs turbo3 vs turbo4 from 0 to 1M context on M5 Max

r/LocalLLaMA
Generative AI Open Source AI

Took TheTom's TurboQuant Metal fork of llama.cpp (github.com/TheTom/llama-cpp-turboquant, the feature/turboquant-k-cache branch) and ran a depth sweep on Qwen 3.6-35B-A3B Q8. TheTom had already published M5 Max numbers up to 32K. I wanted to see what the curves looked like once you push them. Hardware: MacBook Pro M5 Max, 128 GB unified memory. Built the fork with cmake -B build -DGGML_METAL=ON. llama-bench, 3 reps per cell, flash-attn on, mlock on, 8 hours wall-clock overnight. Cache types: f16, q8_0, turbo3, turbo4. Symmetric K and V ( -ctk and -ct set to the same type.