Systematic Capability Benchmarking of Frontier Large Language Models for Offensive Cyber Tasks

ArXi:2604.17159v1 Announce Type: cross We present, to our knowledge, the most comprehensive cross-model evaluation of LLM agents on offensive cybersecurity tasks, benchmarking 10 frontier models from 7 providers on all 200 challenges of the NYU CTF Bench. Building on the D-CIPHER multi-agent framework, we extend it with multi-provider backend, a custom Kali Linux environment with over 100 pre-installed penetration testing tools, and runtime tool-discovery agents.