DFlash speculative decoding on Apple Silicon : 85 tok/s, 3.3x on Qwen3.5-9B (MLX, M5 Max)

I'm building a native MLX implementation of DFlash ( paper ) for Apple Silicon. A small draft model generates 16 tokens in parallel via block diffusion, the target verifies them in one forward pass. Output is bit-for-bit identical to baseline (greedy exact argmax match). Setup: M5 Max, 64GB, MLX, no CUDA.