CollabVR: Collaborative Video Reasoning with Vision-Language and Video Generation Models

ArXi:2605.08735v1 Announce Type: new Recent "Thinking with Video" approaches use Video Generation Models (VGMs) for visual reasoning by producing temporally coherent Chain-of-Frames as reasoning artifacts. Even strong VGMs, however, exhibit two recurring failure modes on goal-directed tasks: long-horizon drift on multi-step tasks and mid-clip simulation errors that compound.