GUI-Perturbed: Domain Randomization Reveals Systematic Brittleness in GUI Grounding Models

ArXi:2604.14262v1 Announce Type: cross GUI grounding models report over 85% accuracy on standard benchmarks, yet drop 27-56%age points when instructions require spatial reasoning rather than direct element naming. Current benchmarks miss this because they evaluate each screenshot once with a single fixed instruction. We