EgoIntent: An Egocentric Step-level Benchmark for Understanding What, Why, and Next

ArXi:2603.12147v1 Announce Type: new Multimodal Large Language Models (MLLMs) have nstrated remarkable video reasoning capabilities across diverse tasks. However, their ability to understand human intent at a fine-grained level in egocentric videos remains largely unexplored. Existing benchmarks focus primarily on episode-level intent reasoning, overlooking the finer granularity of step-level intent understanding.