Hosting Multiple Models
r/LocalLLaMA
•
Generative AI
Open Source AI
AI Tools
I often find myself wanting to host a "larger / capable" model as well as a "smaller/faster" model for simpler stuff. This has been a bit annoying with llama.cpp / vllm / sglang because I need to manage multiple endpoints, they also have no auth and limited obversability. So i ended up putting together a gateway ( LLM Gateway ) to sit infront of and aggregate my multiple instances of these tools into 1 router with auth and langfuse integration. I'm curious how others do this or maybe most people just don't mind managing the multiple unauthenticated endpoints.