TRACE: Evidence Grounding-Guided Multi-Video Event Understanding and Claim Generation

ArXi:2605.16740v1 Announce Type: new Multi-video event understanding demands models that can locate and attribute query-relevant evidence scattered across long, heterogeneous video corpora. Existing large vision-language models (LVLMs) often underperform in this regime because they quickly exhaust their context budget and struggle to precisely localize evidentially important segments, frequently missing dense informational cues such as broadcast graphics, subtitles, and scoreboards. We