How Reliable is Language Model Micro-Benchmarking?

ArXi:2510.08730v2 Announce Type: replace-cross Micro-benchmarking offers a solution to the often prohibitive time and cost of language model development: evaluate on a very small subset of existing benchmarks. Can these micro-benchmarks, however, rank models as consistently as the full benchmarks they replace? And can they rank models consistently than selecting a random subset of data points? In many scenarios, we find that the answer is no. We