FineBench: Benchmarking and Enhancing Vision-Language Models for Fine-grained Human Activity Understanding

ArXi:2605.19846v1 Announce Type: cross Vision-Language Models (VLMs) have nstrated remarkable capabilities in general video understanding, yet they often struggle with the fine-grained comprehension crucial for real-world applications requiring nuanced interpretation of human actions and interactions. While some recent human-centric benchmarks evaluate aspects of model behaviour such as fairness/ethics, emotion perception, and broader human-centric metrics, they do not combine long-form videos, very dense QA coverage, and frame-level spatial/temporal grounding at scale. To bridge this gap, we.