Local model on coding has reached a certain threshold to be feasible for real work

r/LocalLLaMA
Open Source AI

We ran open-weight 27B-32B models on Terminal-Bench 2.0 (89 tasks, terminal-bench-2.git @ 69671fb ) through our agent harness. Best result was Qwen 3.6-27B at 38.2% (34/89) under the default per-task timeout - the same constraint the public leaderboard uses ( Qwen's official post uses a relaxed config ). We deliberately used the default setup for TB official leaderboard, because we wanted an apples-to-apples number against the verified leaderboard. One interesting find is that MOE models still has a order of magnitude of improve in terms of inference speeds.