PRISM: Perception Reasoning Interleaved for Sequential Decision Making

ArXi:2605.05407v1 Announce Type: new Scaling LLM-based embodied agents from text-only environments to complex multimodal settings remains a major challenge. Recent work identifies a perception-reasoning-decision gap in standalone Vision-Language Models (VLMs), which often overlook task-critical information. In this paper, we