Terminal Bench score for Mistral 3.5 Medium

r/LocalLLaMA
Generative AI Open Source AI AI Research

So. there were a couple promising benchmark scores reported by mistralai in the model card for Mistral 3.5 Medium, BUT there wasn't the one that I usually care about the most, which is TerminalBench 2.0. since I was really curious how the new Mistral handles agentic stuff, I decided to benchmark it myself. I didn't run TerminalBench 2.0, because I'm not crazy (usage would be biiiig), BUT I did run TBLite, which is a lighter/faster version of TerminalBench 2.0. The scores in this smaller variant don't correlate directly with TB2 scores,. however.