Benchmarking the new b9200 update: Optimizing Qwen 3.6 27B mtp for Hermes Agent on a single RTX 3090
r/LocalLLaMA
•
Generative AI
Open Source AI
UPDATED (POST b9200) Okay the updated version using qwen 3.6 27B mtp gguf from unsloth, running it as the backend for the hermes agent. While dialing it in, I noticed that the currently recommended unsloth mtp flags actually bottleneck performance and tank draft acceptance rates for strict, multi-turn agentic workflows. Pairing a custom config with today's brand new llama.cpp b9200 release - which specifically fixes mtp memory traffic overhead - completely turns that around.