Luce DFlash + PFlash on AMD Strix Halo: Qwen3.6-27B at 2.23x decode and 3.05x prefill vs llama.cpp HIP

Hey fellow Llamas, keeping it short. We just shipped DFlash and PFlash for the AMD Ryzen AI MAX+ 395 iGPU (gfx1151, Strix Halo, 128 GiB unified memory). Same Luce DFlash stack from the RTX 3090 post a couple weeks back, now running on the consumer AMD APU class. Repo: (MIT) TL;DR End-to-end on Qwen3.6-27B Q4_K_M with the Luce Q8_0 DFlash drafter: 26.85 tok/s decode and 20.2 s prefill at 16K context. That is 2.23x faster decode and 3.05x faster prefill than llama.cpp HIP on the same silicon.