GLM 5.1 crushes every other model except Opus in agentic benchmark at about 1/3 of the Opus cost

r/LocalLLaMA
Generative AI AI Research

I want to know whether GLM is another benchmark optimized model or actually useful in agents like OpenClaw, so I tested GLM 5.1 in our agentic benchmark. Turns out it reaches Opus 4.6 level performance with just 1/3 of the cost (~$0.4 per run vs ~$1.2 per run) based on my tests. It outperforms all other models tested. Pushes the cost effectiveness frontier quite a bit. I don't quite trust any static benchmarks, seen many models optimized for it, ranking high on those leaderboard but not working well in real agentic tasks.