AI benchmarks don't mean what you think they mean (Sponsor)

Benchmarks are trotted out whenever a new model is released, but what do they actually measure? ngrok's Sam Rose dug into the papers, code, and critiques to understand what 14 popular AI benchmarks actually mean. E.g., SWE-bench Verified = fixing small bugs in 12 popular open source Python repositories. See what each benchmark measures