A protocol for auditing AI agent harnesses

I have been building coding agents for the last several months and watching every component I added fail to move the resolve-rate I cared about. A verifier first. Multi-candidate sampling next. A structured-output sub-agent after that. Each was justified by a specific observed failure mode and each looked cheap at the margin. None of them helped.