Offloading LLM matrix multiplication to the AMD XDNA2 NPU on Ryzen AI MAX 385 : 43.7 t/s decode at 0.947 J/tok

Built a custom llama.cpp backend that dispatches GEMM ops directly to the XDNA2 NPU on Ryzen AI MAX 385 (Strix Halo). No iGPU and no shared memory contention. Model: Meta-Llama-3.1-8B-Instruct Q4_K_M Hardware: Ryzen AI MAX 385, CachyOS 6.19, amdxdna driver, XRT 2.21.75 2.21.75 Results Backend Prefill (t/s pp512) Decode (t/s tg64) Avg Power J/tok Vulkan prefill + NPU decode 930 43.7 41.5 W 0.947 Vulkan only 833 41.6 52.2 W 1.3 CPU only 4.6 3.76 - - The NPU decode path saves ~10W vs Vulkan-only while matching (slightly beating) decode throughput, because the iGPU is free for other work.