Video-Zero: Self-Evolution Video Understanding

ArXi:2605.14733v1 Announce Type: new Self-evolution offers a promising path for improving reasoning models without relying on intensive human annotation. However, extending this paradigm to video understanding remains underexplored and challenging: videos are long, dynamic, and redundant, while the evidence needed for reasoning is often sparse and temporally localized. Naively generating difficult question-answer pairs from full videos can. therefore. produce supervision that appears challenging but is weakly grounded, relying on static cues or language priors rather than temporal evidence.