5 silent failure modes in production AI agents (and how we instrument for them)

AI agents fail differently than apps. The failure rarely lives in the work itself. It lives in the seams: the delivery step, the tool call, the inbound routing decision, the bootstrap that ate the budget. None of those surface as exceptions, so APM dashboards say "green" while the user sees nothing. Here are five failure modes that show up that way, and how we instrument or defend against each. 1. Crons that "succeed" but never deliver The cron framework doesn't know what the user received; it knows what the agent reported per-run.