RX 9070 (RDNA4/gfx1201) ROCm 7.2.1 llama.cpp Benchmarks — The Flash Attention Discovery

r/LocalLLaMA
Generative AI Open Source AI

**Hardware:** AMD Ryzen 9 9900X | RX 9070 16GB VRAM (RDNA 4, gfx1201) | 192GB DDR5 | Ubuntu 24.04 **ROCm version:** 7.2.1 **llama.cpp build:** ROCm with `-DGGML_CUDA_FORCE_MMQ=ON -DGGML_HIP_GRAPHS=ON` --- ## TL;DR ROCm 7.2.1 on the RX 9070 (RDNA4) beats Vulkan on prompt processing once you enable flash attention and the right build flags. Token generation still favors Vulkan on MoE models. The default ROCm build is catastrophically slow - flash attention alone gives a 5.5× improvement on prompt processing for dense models.