PRISM: A Multi-View Multi-Capability Retail Video Dataset for Embodied Vision-Language Models

ArXi:2603.29281v1 Announce Type: cross A critical gap exists between the general-purpose visual understanding of state-of-the-art physical AI models and the specialized perceptual demands of structured real-world deployment environments. We present PRISM, a 270K-sample multi-view video supervised fine-tuning (SFT) corpus for embodied vision-language-models (VLMs) in real-world retail environments.