Memory-Augmented Vision-Language Agents for Persistent and Semantically Consistent Object Captioning

ArXi:2603.24257v1 Announce Type: new Vision-Language Models (VLMs) often yield inconsistent descriptions of the same object across viewpoints, hindering the ability of embodied agents to construct consistent semantic representations over time. Previous methods resolved inconsistencies using offline multi-view aggregation or multi-stage pipelines that decouple exploration, data association, and caption learning, with limited capacity to reason over previously observed objects. In this paper, we.