DFlash speculative decoding on Apple Silicon: 4.1x on Qwen3.5-9B, now open source (MLX, M5 Max)

A few weeks ago I posted early results from a native MLX implementation of DFlash. Since then I rewrote the benchmark methodology, fixed numerical issues, and open sourced the whole thing. A small draft model generates 16 tokens in parallel via block diffusion, the target verifies them in one forward pass. Every emitted token is verified against the target model before being committed. Lossless. Stock MLX, no fork. Setup: M5 Max, 64GB, MLX 0.31.1. Baseline is stock mlx_lm.stream_generate, not a custom loop. 3 runs, median reported, 10s cooldown.