Don't Pass@k: A Bayesian Framework for Large Language Model Evaluation

ArXi:2510.04265v3 Announce Type: replace Pass$$ is widely used to report the reasoning performance of LLMs, but it often produces unstable and potentially misleading rankings, especially when the number of trials (samples) is limited and computational resources are constrained. We present a principled Bayesian evaluation framework that replaces Pass$$ and average accuracy over $N$ trials (avg$$) with posterior estimates of a model's underlying success probability and credible intervals, yielding stable rankings and a transparent decision rule for differences.