OmniGUI: Benchmarking GUI Agents in Omni-Modal Smartphone Environments

ArXi:2605.18758v1 Announce Type: cross Current benchmarks for graphical user interface (GUI) agents predominantly rely on static screenshots. However, real-world smartinteraction routinely requires agents to process transient audio cues and temporal video dynamics that are tightly coupled with the moment of action. To bridge this gap, we