CMTM: Cross-Modal Token Modulation for Unsupervised Video Object Segmentation

ArXi:2604.14630v1 Announce Type: cross Recent advances in unsupervised video object segmentation have highlighted the potential of two-stream architectures that integrate appearance and motion cues. However, fully leveraging these complementary sources of information requires effectively modeling their interdependencies. In this paper, we