Spatiotemporal Sycophancy: Negation-Based Gaslighting in Video Large Language Models

ArXi:2604.17873v1 Announce Type: new Video Large Language Models (Vid-LLMs) have nstrated remarkable performance in video understanding tasks, yet their robustness under conversational interaction remains largely underexplored. In this paper, we identify spatiotemporal sycophancy, a failure mode in which Vid-LLMs retract initially correct, visually grounded judgments and conform to misleading user feedback under negation-based gaslighting. Rather than merely changing their answers, the models often fabricate uned temporal or spatial explanations to justify incorrect revisions.