Collider-Bench: Benchmarking AI Agents with Particle Physics Analysis Reproduction

ArXi:2605.13950v1 Announce Type: new Autonomous language-model agents are increasingly evaluated on long-horizon tool-use tasks, but existing benchmarks rarely capture the complexity and nuance of real scientific work. To address this gap, we