Is using vLLM actually worth it if you aren't serving the model to other people?
r/LocalLLaMA
•
Generative AI
AI Hardware
Open Source AI
AI Tools
So, as most of us here are, I'm a llama.cpp loyalist. Easy to understand, great configuration, relatively stable, etc. But I’ve been increasingly tempted by vLLM, especially since AMD just added it as a built-in inference engine to Lemonade, and I happen to have an AMD GPU. The thing is, I've never actually used vLLM directly, but I've heard good things about how it performs compared to llama.cpp, with vLLM apparently outperforming it pretty much across the board.