Capture the Flags: Family-Based Evaluation of Agentic LLMs via Semantics-Preserving Transformations

ArXi:2602.05523v2 Announce Type: replace-cross Agentic large language models (LLMs) are increasingly evaluated on cybersecurity tasks using capture-the-flag (CTF) benchmarks, yet existing pointwise benchmarks offer limited insight into agent robustness and generalisation across alternative versions of the source code. We