NarrativeTrack: Evaluating Entity-Centric Reasoning for Narrative Understanding

ArXi:2601.01095v2 Announce Type: replace-cross Multimodal large language models (MLLMs) have achieved impressive progress in vision-language reasoning, yet their ability to understand temporally unfolding narratives in videos remains underexplored. True narrative understanding requires grounding who is doing what, when, and where, maintaining coherent entity representations across dynamic visual and temporal contexts. We