AI RESEARCH
TennisTV: Do Multimodal Large Language Models Understand Tennis Rallies?
arXiv CS.CV
•
ArXi:2509.15602v5 Announce Type: replace Multimodal large language models (MLLMs) excel at general video understanding but struggle with fast, high-frequency sports like tennis, where rally clips are short yet information-dense. To systematically evaluate MLLMs in this challenging domain, we present TennisTV, the first and most comprehensive benchmark for tennis video understanding. TennisTV models each rally as a temporal-ordered sequence of consecutive stroke events, using automated pipelines for filtering and question generation.