DiagEval: Trajectory-Conditioned Diagnosis for Reliable Software Evaluation with GUI Agents

ArXi:2605.17439v2 Announce Type: replace-cross Evaluating LLM-generated interactive software requires execution in addition to static analysis. The key difficulty is that correctness is a graph-level reachable property over latent UI state-transition graphs, whereas a GUI evaluator observes only a single execution trajectory. A failed rollout therefore rules out only one realized path, leaving failure attribution ambiguous between evaluator-side execution error and genuine software defect.