I tracked a major cache reuse issue down to Qwen 3.5’s chat template

r/LocalLLaMA
Generative AI Open Source AI

Over the last week, I’ve been investigating cache misses while optimizing local agent workflows on my M5 Max. My setup used oMLX.ai as a backend with agents like OpenCode.ai and Pi.de, but I reproduced the same behavior with other backends like llama.cpp too. At first, I assumed this was an inference engine issue or a cache implementation bug.