Temporal Gains, Spatial Costs: Revisiting Video Fine-Tuning in Multimodal Large Language Models

ArXi:2603.17541v1 Announce Type: new Multimodal large language models (MLLMs) are typically trained in multiple stages, with video-based supervised fine-tuning (Video-SFT) serving as a key step for improving visual understanding. Yet its effect on the fine-grained evolution of visual capabilities, particularly the balance between spatial and temporal understanding, remains poorly understood. In this paper, we systematically study how Video-SFT reshapes visual capabilities in MLLMs.