VideoTIR: Accurate Understanding for Long Videos with Efficient Tool-Integrated Reasoning

ArXi:2603.25021v1 Announce Type: new Existing Multimodal Large Language Models (MLLMs) often suffer from hallucinations in long video understanding (LVU), primarily due to the imbalance between textual and visual tokens. Observing that MLLMs handle short visual inputs well, recent LVU works alleviate hallucinations by automatically parsing the vast visual data into manageable segments that can be effectively processed by MLLMs.