LLM Benchmarks Are Junk Science

Towards AI
Generative AI

An Oxford review of 445 benchmarks found 84% lack basic statistical testing. Models score 90% on standard tests but 2% on unseen problems…