SaSaSaSa2VA: 2nd Place of the 5th PVUW MeViS-Text Track

ArXi:2603.27241v1 Announce Type: new Referring video object segmentation (RVOS) commonly grounds targets in videos based on static textual cues. MeViS benchmark extends this by incorporating motion-centric expressions (referring & reasoning motion expressions) and