What benchmarks actually matter when comparing LLMs?
r/LocalLLaMA
•
Generative AI
AI Research
I’ve been digging into LLM benchmarks lately and I’m a bit overwhelmed by how many there are, and how inconsistent they feel. You’ve got things like MMLU (general knowledge), GSM8K (math/reasoning), HumanEval (coding), HELM / BIG-bench variants and the list goes on. But they all measure different things, and some seem easier to game or overfit than others. I’m currently building a small open-source project where I try to aggregate benchmark results into a unified view (kind of like a “Metacritic” for LLMs), but I’m not convinced I’m choosing the right signals.