[Help] Running big dense models faster
r/LocalLLaMA
•
Generative AI
Open Source AI
I have been trying Mistral 3.5 on my 4x RTX 3090 rig with llama.cpp. Inference is slow (about 11 t/s) even without anything being offloaded to the CPU. Here is the llama-server command I used:./llama-server --model./downloaded_models/Mistral-Medium-3.5-128B-UD-Q4_K_XL-00001-of-00003.gguf --port 11433 --host 0.0.0.0 --temp 0.7 --jinja -fa on --chat-template-kwargs '{"reasoning_effort":"none"}' llama.cpp automatically set a context window size of about 44000 tokens to fit the computation entirely on the GPUs.