At wits end for optimizing settings in llama.cpp for 100k context
r/LocalLLaMA
•
Generative AI
Open Source AI
Long story short, I am running Qwen3.5-35B-A3B (GGUF format) and other models on MacOS and getting around 1500 tokens/sec for prompt processing and around 35-50 tokens per second for prompt processing. I'm using the latest version of llama.cpp on MacOS. The problem I'm having is that I'm spending time trying to optimize settings than running inference.