AI RESEARCH
ViSRA: A Video-based Spatial Reasoning Agent for Multi-modal Large Language Models
arXiv CS.AI
•
ArXi:2605.10106v1 Announce Type: cross Recent advances in Multi-modal Large Language Models (MLLMs) target 3D spatial intelligence, yet the progress has been largely driven by post-