MTP on strix halo with llama.cpp (PR #22673)

r/LocalLLaMA
Generative AI Open Source AI

I saw a post about incoming MTP in llama.cpp so i tried it out on a AI max 395 with 128GB DDR5 8000: I rebuilt the rad container from with that PR: I ran that GGUF: and added --spec-type mtp --spec-draft-n-max 3 Result: between 60 and 80 token/s from 40ish token/s without MTP (on the screen i was trying rocm but it's like 40-45 token/s with vulkan) depending on the subject (some common math stuff seems to be the fastest). PP seems unchanged.