Scales++: Compute Efficient Evaluation Subset Selection with Cognitive Scales Embeddings

ArXi:2510.26384v2 Announce Type: replace-cross The prohibitive cost of evaluating large language models (LLMs) on comprehensive benchmarks necessitates the creation of small yet representative data subsets (i.e., tiny benchmarks) that enable efficient assessment while retaining predictive fidelity. Current methods for this task operate under a model-centric paradigm, selecting benchmarking items based on the collective performance of existing models.