MLX is not faster. I benchmarked MLX vs llama.cpp on M1 Max across four real workloads. Effective tokens/s is quite an issue. What am I missing? Help me with benchmarks and M2 through M5 comparison.

Disclaimer: I am fairly new to running local LLMs. But I like to know, measure and build things. So I kept seeing "use MLX on Mac, it's 2x faster" everywhere. Loaded Qwen3.5-35B-A3B to my M1 Max 64GB I bought used. LM Studio, saw 57 tok/s generation vs 29 tok/s for the same GGUF model. Seemed obvious. I expected everything to be snappy. Well. turns out: No. Then I timed actual tasks. GGUF was faster in document classifications and not much faster in multi-turn agent conversations. That sent me down a rabbit hole. That tok/s number only measures generation (tokens produced one at a time.