Minerva-Ego: Spatiotemporal Hints for Egocentric Video Understanding

ArXi:2605.15342v1 Announce Type: cross Video reasoning models are a core component of egocentric and embodied agents. However, standard benchmarks for assessing models provide only evaluation of the output (e.g. the answer to a question), without evaluation of intermediate reasoning steps, and most provide answers only in the text domain. We