Tips: remember to use -np 1 with llama-server as a single user
r/LocalLLaMA
•
Generative AI
AI Hardware
Open Source AI
Llama-serve.cp on default behavior may allocates 4x context size in order to serve multiple clients, if you are a single user on a system with little VRAM you know that the bigger the context length -> smaller LM in VRAM -> reduced speed. So launch with llama-server -np1, maybe add --fit-target 126 On my 12GB GPU with 60k context I got ~20% TPS. One more: if you use Firefox (or others) disable hw acceleration: Go to Settings > General > Performance. Uncheck "Use recommended performance settings". Uncheck "Use hardware acceleration when available". Restart Firefox.