EVMbench Deep Dive: Can AI Agents Actually Find Smart Contract Bugs Better Than Human Auditors? We Tested the Claims

TL;DR OpenAI and Paradigm's EVMbench benchmark claims GPT-5.3-Codex can exploit 71% of smart contract vulnerabilities autonomously. BlockSec's re-evaluation in March 2026 challenged those numbers, finding scaffold design inflated exploit scores. Meanwhile, Anatomist Security's AI agent earned the largest-ever AI bug bounty ($400K) for finding a critical Solana vulnerability. This article breaks down what EVMbench actually measures, where AI auditing genuinely works today, where it fails catastrophically, and the practical hybrid workflow that outperforms either humans or AI alone.