Why ML accuracy numbers are unfalsifiable, and what a 1287-line Python tool does about it" published: false

A few weeks ago I was reading a model card for an open-weight code model. It claimed pass = 67% on HumanEval. I tried to reproduce it. I got 54%. I went back to the model card. The metric was named, the dataset was named, the model checkpoint hash was published. Everything looked reproducible.