STORM: End-to-End Referring Multi-Object Tracking in Videos

ArXi:2604.10527v1 Announce Type: cross Referring multi-object tracking (RMOT) is a task of associating all the objects in a video that semantically match with given textual queries or referring expressions. Existing RMOT approaches decompose object grounding and tracking into separated modules and exhibit limited performance due to the scarcity of