A Tale of Two Variances: When Single-Seed Benchmarks Fail in Bayesian Deep Learning

ArXi:2604.23114v1 Announce Type: new In limited-data settings, a single endpoint mean of an evaluation metric such as the Continuous Ranked Probability Score (CRPS) is itself a random variable, yet it is routinely reported as if it were a stable property of the method. We study when this practice fails. Using 50 independent repetitions across six regression datasets, we show that CRPS variance trajectories differ substantially across methods and are not always well described by a smooth power-law decay.