MotionBench: Benchmarking and Improving Fine-grained Video Motion Understanding for Vision Language Models

ArXi:2501.02955v2 Announce Type: replace In recent years, vision language models (VLMs) have made significant advancements in video understanding. However, a crucial capability - fine-grained motion comprehension - remains under-explored in current benchmarks. To address this gap, we propose MotionBench, a comprehensive evaluation benchmark designed to assess the fine-grained motion comprehension of video understanding models.