Towards Temporal Compositional Reasoning in Long-Form Sports Videos

ArXi:2604.22226v1 Announce Type: new Sports videos are a challenging domain for multimodal understanding because they involve complex and dynamic human activities. Despite rapid progress in Multimodal Large Language Models (MLLMs), long-horizon reasoning in sports videos remains difficult, as answering questions requires both locating temporally sparse evidence and integrating it into reasoning.