Stay in your Lane: Role Specific Queries with Overlap Suppression Loss for Dense Video Captioning

ArXi:2603.11439v1 Announce Type: new Dense Video Captioning (DVC) is a challenging multimodal task that involves temporally localizing multiple events within a video and describing them with natural language. While query-based frameworks enable the simultaneous, end-to-end processing of localization and captioning, their reliance on shared queries often leads to significant multi-task interference between the two tasks, as well as temporal redundancy in localization.