Video-STAR: Reinforcing Open-Vocabulary Action Recognition with Tools

ArXi:2510.08480v2 Announce Type: replace Multimodal large language models (MLLMs) have nstrated remarkable potential in bridging visual and textual reasoning, yet their reliance on text-centric priors often limits their ability to disentangle semantically similar actions in open-vocabulary scenarios. To address this, we propose Video-STAR, a framework that harmonizes contextual sub-motion decomposition with tool-augmented reinforcement learning for open-vocabulary action recognition