Painfully slow local llama on 5090 and 192GB RAM

r/LocalLLaMA
Generative AI AI Hardware Open Source AI

I am running a llama server with the following command: nohup./llama-server \ --model "/path/to/your/models/MiniMax-M2.5-UD-Q3_K_XL.gguf" \ --alias "minimax_m2.5" \ --threads $(nproc) \ --threads-batch $(nproc) \ --n-gpu-layers -1 \ --port 8001 \ --ctx-size 65536 \ -b 4096 -ub 4096 \ --temp 1.0 \ --top-p 0.95 \ --min-p 0.01 \ --top-k 40 \ > llama-server.log 2>&1 & and then ollama launch claude --model frob/minimax-m2.5 i wait than 10 minutes for the first answer to come back when I give it a first prompt, subsequent prompts remain similarly slow.