EgoVITA: Learning to Plan and Verify for Egocentric Video Reasoning

ArXi:2511.18242v2 Announce Type: replace Egocentric video understanding requires procedural reasoning under partial observability and continuously shifting viewpoints. Current multimodal large language models (MLLMs) struggle with this setting, often generating plausible but visually inconsistent or weakly grounded responses. We