AI RESEARCH

AgentRVOS for MeViS-Text Track of 5th PVUW Challenge: 3rd Method

arXiv CS.CV

ArXi:2604.22836v1 Announce Type: new This report describes a Ref-VOS pipeline centered on Sa2VA and organized with explicit agent roles. The key idea is that Sa2VA should provide the first dense semantic hypothesis, while an agent loop decides whether that hypothesis should be accepted, revised, or refined. The pipeline starts with a target-presence judgment stage. If the referred object does not exist in the video, the system directly outputs zero masks. Otherwise, Sa2VA receives the video and referring prompt and produces a coarse mask trajectory over the full video.