Understanding Temporal Logic Consistency in Video-Language Models through Cross-Modal Attention Discriminability

ArXi:2510.08138v2 Announce Type: replace-cross Large language models (LLMs) often generate self-contradictory outputs, which severely impacts their reliability and hinders their adoption in practical applications. In video-language models (Video-LLMs), this phenomenon recently draws the attention of researchers. Specifically, these models fail to provide logically consistent responses to rephrased questions based on their grounding outputs. However, the underlying causes of this phenomenon remain underexplored.