IBM and UC Berkeley Diagnose Why Enterprise Agents Fail Using IT-Bench and MAST

The "Black Box" Problem of Agent Benchmarks The Experiment: Diagnosing ITBench Agents Finding 1: Stronger models like Gemini-3-Flash shows surgical (isolated failure modes) per trace whereas open sourced Kimi-K2 and GPT-oss-120b show compounding failure patterns Finding 2: "Non-Fatal" vs. "Fatal" Failures The "Non-Fatal" (Benign) Flaws The "Fatal" Flaws: Gemini-3-Flash (Decisive but Overconfident): GPT-OSS-120B A different (and useful) way to read the plots: “fatal” vs