The LLM tunes its own llama.cpp flags (+54% tok/s on Qwen3.5-27B)
r/LocalLLaMA
•
Generative AI
Open Source AI
This is V2 of my previous post. What's new: --ai-tune - the model starts tuning its own flags in a loop and caches the fastest config it finds. My weird rig: 3090 Ti + 4070 + 3060 + 128GB RAM. Model llama-server llm-server v1 tuning llm-server v2 (ai-tuning) Qwen3.5-122B 4.1 tok/s 11.2 tok/s 17.47 tok/s Qwen3.5-27B Q4_K_M 18.5 tok/s 25.94 tok/s 40.05 tok/s gemma-4-31B UD-Q4_K_XL 14.2 tok/s 23.17 tok/s 24.77 tok/s What I think is best here: --ai-tune keeps up with updates on llama.cpp / ik_llama.cpp automatically, because it feeds llama-server --help into the LLM tuning loop as context.