GUI-AIMA: Aligning Intrinsic Multimodal Attention with a Context Anchor for GUI Grounding

ArXi:2511.00810v3 Announce Type: replace-cross Graphical user interface (GUI) grounding is a key capability for computer-use agents, mapping natural-language instructions to actionable regions on the screen. Existing Multimodal Large Language Model (MLLM) approaches typically formulate GUI grounding as a text-based coordinate generation task. However, directly generating precise coordinates from visual inputs is challenging and often data-intensive. A intuitive strategy is to first identify instruction-relevant visual patches and then determine the exact click location within them.