WinDeskGround: A Benchmark for Robust GUI Grounding in Complex Multi-Window Desktop Environments

ArXi:2605.16402v1 Announce Type: new Multimodal Large Language Models (MLLMs) have revolutionized GUI automation, yet their efficacy is largely established on idealized, single-layer interfaces. This paper identifies a critical reliability gap: state-of-the-art agents face distinct robustness challenges in real-world desktop environments characterized by multi-window stacking, occlusion, and visual clutter. To address this, we