Senses Wide Shut: A Representation-Action Gap in Omnimodal LLMs

ArXi:2605.13737v1 Announce Type: new When an omnimodal large language model accepts a question whose textual premise contradicts what it actually sees or hears, does the failure lie in perception or in action? Recent omnimodal models are positioned as perception-grounded agents that jointly process video, audio, and text, yet a basic form of grounding remains untested: catching a textual claim that conflicts with the model's own sensory input. We