BenchBench: Benchmarking Automated Benchmark Generation

ArXi:2603.20807v1 Announce Type: new Benchmarks are the de facto standard for tracking progress in large language models (LLMs), yet static test sets can rapidly saturate, become vulnerable to contamination, and are costly to refresh. Scalable evaluation of open-ended items often relies on LLM judges,