MementoGUI: Learning Agentic Multimodal Memory Control for Long-Horizon GUI Agents

ArXi:2605.18652v1 Announce Type: new Recent GUI agents have made substantial progress in visual grounding and action prediction, yet they remain brittle in long-horizon tasks that require maintaining task state across many interface transitions. Existing agents typically rely on raw history replay or text-only memory, which either overwhelms the model with redundant screenshots or discards localized visual evidence needed for future decisions. To address these limitations, we