From Controlled to the Wild: Evaluation of Pentesting Agents for the Real-World

ArXi:2605.10834v1 Announce Type: new AI pentesting agents are increasingly credible as offensive security systems, but current benchmarks still provide limited guidance on which will perform best in real-world targets. Existing evaluation protocols assess and optimize for predefined goals such as capture-the-flag, remote code execution, exploit reproduction, or trajectory similarity, in simplified or narrow settings.