Why it's getting harder to measure AI performance (9 minute read)

The METR group's data suggests that API progress is moving at an exponential rate. Some models achieve scores above the previous trend line, suggesting very rapid progress indeed. However, despite ability, task lengths still vary significantly, making METR's measurements difficult to use as a comparison of progress. While newer models appear to be better than previous ones, it is hard to say how much better they are.