How we almost wrote off 3 models as broken — the thinking-mode tax

How we almost wrote off 3 models as broken - the thinking-mode tax By Vilius Vystartas | May 2026 Three models scored under 15% in my first benchmark run. Kimi K2.5: 10%. MiniMax M2.5: 15%. Gemma 4: HTTP 400 on every call. I almost excluded them as broken. They weren't broken - I was calling them wrong. Here's what happened and how to avoid it when benchmarking your own models. The symptoms Kimi K2.5 (10%): Every response was empty. The model returned exactly 300 tokens of nothing. finish_reason: length - it ran out of budget before producing visible output.