Benchmarked Yet Not Measured -- Generative AI Should be Evaluated Against Real-World Utility

ArXi:2605.06856v1 Announce Type: new Generative AI systems achieve impressive performance on standard benchmarks yet fail to deliver real-world utility, a disconnect we identify across 28 deployment cases spanning education, healthcare, software engineering, and law. We argue that this benchmark utility gap arises from three recurring failures in evaluation practice: proxy displacement, temporal collapse, and distributional concealment.