When Thinking Hurts: Mitigating Visual Forgetting in Video Reasoning via Frame Repetition

ArXi:2603.16256v1 Announce Type: new Recently, Multimodal Large Language Models (MLLMs) have nstrated significant potential in complex visual tasks through the integration of Chain-of-Thought (CoT) reasoning. However, in Video Question Answering, extended thinking processes do not consistently yield performance gains and may even lead to degradation due to ``visual anchor drifting'', where models increasingly rely on self-generated text, sidelining visual inputs and causing hallucinations. While existing mitigations typically