TurboQuant on MLX: 4.6x KV cache compression with custom Metal kernels (Qwen 32B at 98% FP16 speed)

r/LocalLLaMA
Open Source AI AI Research

Implemented TurboQuant (Google's new KV cache compression paper) for MLX with fused Metal kernels. Results on Qwen2.5-32B, M4 Pro 48GB: - 4.6x compression, 0.98x FP16 speed, identical quality - 16K context: 4.2GB cache → 897MB The main challenge was speed - went from 0.28x to 0.98x FP16 through fused Metal quantize/dequantize kernels and an incremental decode buffer. Writeup with the full optimization journey: Code: PR to mlx-lm: submitted by /u/dirtyhand3 [link] [comments.