VTAgent: Agentic Keyframe Anchoring for Evidence-Aware Video TextVQA

ArXi:2605.04870v1 Announce Type: new Video text-based visual question answering (Video TextVQA) aims to answer questions by reasoning over visual textual content appearing in videos. Despite the strong multimodal video understanding capabilities of recent Video-LLMs, their performance on existing Video TextVQA benchmarks remains limited.