We're actually running out of benchmarks to upper bound AI capabilities

Written quickly as part of the Inkhaven Residency. Opinions are my own and do not represent METR’s official opinion. In early 2025, the situation for upper-bounding model capabilities using fixed benchmarks was already somewhat challenging. As part of the trend where benchmarks were being saturated at an ever increasing rate, benchmarks that were incredibly challenging for AI in early 2024 such as GPQA were being saturated scarcely a year later.