Frameworks For Supporting LLM/Agentic Benchmarking [P]

I think the way we are approaching benchmarking is a bit problematic. From reading about how frontier labs benchmark their models, they essentially create a new model, configure a harness, and then run a massive benchmarking suite just to nstrate marginal gains. I have several problems with this approach. I worry that we are wasting a significant amount of resources iterating on models and effectively trading carbon for confidence. Looking at the latest Gemini benchmarking, for instance, they applied 30,000 prompts.