SynPO: Synergizing Descriptiveness and Preference Optimization for Video Detailed Captioning

ArXi:2506.00835v2 Announce Type: replace Fine-grained video captioning aims to generate detailed, temporally coherent descriptions of video content. However, existing methods struggle to capture subtle video dynamics and rich detailed information. In this paper, we leverage preference learning to enhance the performance of vision-language models in fine-grained video captioning, while mitigating several limitations inherent to direct preference optimization (DPO