Beyond Static Snapshots: A Grounded Evaluation Framework for Language Models at the Agentic Frontier

ArXi:2604.17573v1 Announce Type: new We argue that current evaluation frameworks for large language models (LLMs) suffer from four systematic failures that make them structurally inadequate for assessing deployed, agentic systems: distributional invalidity (evaluation inputs do not reflect real interaction distributions), temporal invalidity (evaluations are post-hoc rather than