For the 5 people here running vLLM on multiple R9700s, you need to patch in support for AITER Unified Attention.
r/LocalLLaMA
•
Generative AI
Open Source AI
AI Tools
I have a 4 x R9700 system on Threadripper pro, but I have never been happy with the performance of my GPUs in vLLM. I have started benchmarking any new model I try out with llama-benchy so that I can get a better idea of how models of different sizes and architectures compare on my system. In every model that I have tested, I run into a wall around 64k tokens context. TTFT, TG and PP would all fall on their face at long context lengths. So this past weekend I rented a MI300X from RunPod thinking that AMD must have this issue sorted on.