Tenacious-Bench v0.1: What Happens When You Build a Benchmark for Your Own Agent's Failures" published

Most benchmark papers start with a general problem. This one starts with a specific embarrassment. The Tenacious AI sales agent - an LLM-powered system that writes B2B outreach emails for a talent platform - achieves 38.7% pass on the τ²-Bench retail evaluation slice. That means roughly one in three emails it produces fully satisfies the quality rubric. For the other two, something is wrong: the agent over-commits on bench capacity, uses assertive tone when the prospect is clearly still researching, or references prospect signals so generically that they might as well not be there.