Decoupling Perception from Reasoning for Hallucination-Resistant Video Understanding

ArXi:2511.18463v3 Announce Type: replace Video Large Language Models improve reasoning over complex videos by generating intermediate reasoning text. However, reliable reasoning depends on accurate video perception. In existing approaches, perception evidence is intertwined with reasoning text, making it difficult to directly supervise the perception process. We argue that reliable supervision requires explicitly separating perception evidence from reasoning so that perception can be verified independently.