Environmental Understanding Vision-Language Model for Embodied Agent

ArXi:2604.19839v1 Announce Type: new Vision-language models (VLMs) have shown strong perception and reasoning abilities for instruction-following embodied agents. However, despite these abilities and their generalization performance, they still face limitations in environmental understanding, often failing on interactions or relying on environment metadata during execution.