VVitCutLER: Towards Unsupervised Object Detection and Segmentation in Videos

ArXi:2605.17584v1 Announce Type: new Unsupervised pixel-level video understanding remains challenging in real-world scenarios, where motion blur, occlusion, and fast object dynamics often cause temporal drift and flickering pseudo-labels. We propose VVitCutLER, an unsupervised framework for video object detection and instance segmentation, which improves the quality of pseudo-labels through temporal consistency. Our core contribution is VitCut, a temporarily stable pseudo-label generator that reduces error accumulation during field degradation through cross-frame region consistency.