Grounding Video Reasoning in Physical Signals

ArXi:2604.21873v1 Announce Type: new Physical video understanding requires than naming an event correctly. A model can answer a question about pouring, sliding, or collision from textual regularities while still failing to localize the event in time or space. We