Qwen3.5-397B at 17-19 tok/s on a Strix Halo iGPU — all 61 layers on GPU via Vulkan (not ROCm)

r/LocalLLaMA
Generative AI AI Hardware Open Source AI

Running Qwen3.5-397B-A17B (IQ2_XXS, 107GB, 4 GGUF shards) at 17-19 tok/s generation and **25-33 tok/s prompt processing** on a single AMD Ryzen AI Max+ 395 with 128GB unified memory. All 61 layers offloaded to the integrated Radeon 8060S GPU. Total hardware cost: ~$2,500. ​ The setup: - AMD Ryzen AI Max+ 395 (Strix Halo), Radeon 8060S (gfx1151, RDNA 3.5, 40 CUs) - 128GB LPDDR5X unified memory - llama.cpp built with **Vulkan** (Mesa RADV 24.2.8), NOT ROCm/HIP - Ubuntu, kernel 6.17 The key finding: use Vulkan, not ROCm.