Ranking Reasoning LLMs under Test-Time Scaling

ArXi:2603.10960v1 Announce Type: new Test-time scaling evaluates reasoning LLMs by sampling multiple outputs per prompt, but ranking models in this regime remains underexplored. We formalize dense benchmark ranking under test-time scaling and