LLM Benchmarks Are Junk Science
Towards AI
•
Generative AI
An Oxford review of 445 benchmarks found 84% lack basic statistical testing. Models score 90% on standard tests but 2% on unseen problems…