GGUF (llama.cpp) vs MLX Round 2: Your feedback tested, two models, five runtimes. Ollama adds overhead. My conclusion. Thoughts?

Two weeks ago I posted here that MLX was slower than GGUF on my M1 Max. You gave feedback, pointed out I picked possibly the worst model for MLX. Broken prompt caching ( mlx-lm ), hybrid attention MLX can't optimize, bf16 on a chip that doesn't do bf16. So I went and tested almost all of your hints and recommendations. Two mature models (Gemma 12B QAT, Qwen3 30B-A3B), five runtimes, and the bf16→fp16 fix u/bakawolf123 suggested for M1/M2 chips. Also compiled llama.cpp from source to check if LM Studio adds overhead. Same M1 Max 64GB.