AI Writes Your Tests. Here's What It Systematically Misses.

Dev.to AI
AI Research

Here's What It Systematically Misses. We ran a tool called Optinum against 16 real bugs from SWE-bench Verified - a dataset of production OSS issues with human-verified patches. In 62.5% of cases, the AI-written tests that accompanied each fix missed the exact failure class the bug belonged to. Not random misses. The same categories, over and over. We also took one instance, synthesized a test, and proved it in Docker: the test fails on the bug commit and passes on the fix commit. No spreadsheets, no hand-waving.