I got 3× faster HFQ4 prefill on Strix Halo in hipfire with an opt-in MMQ path
r/LocalLLaMA
•
Generative AI
I recently contributed an experimental HFQ4-G256 MMQ prefill path to hipfire, an RDNA-focused LLM inference engine. Disclaimer: I authored the PR, so this is partly a contribution note, but I am mainly looking for independent validation from other AMD users. Before this PR, HFQ4 prefill in hipfire was going through a generic/slower path. On my Strix Halo system, prompt processing was clearly the bottleneck: longer prefills were around ~310-340 tok/s. The new path adds an opt-in MMQ-style prefill implementation.