Reinforcing Consistency in Video MLLMs with Structured Rewards

ArXi:2604.01460v1 Announce Type: new Multimodal large language models (MLLMs) have achieved remarkable progress in video understanding. However, seemingly plausible outputs often suffer from poor visual and temporal grounding: a model may fabricate object existence, assign incorrect attributes, or collapse repeated events while still producing a globally reasonable caption or answer.