AI RESEARCH
SOUPLE: Enhancing Audio-Visual Localization and Segmentation with Learnable Prompt Contexts
arXiv CS.CV
•
ArXi:2603.22732v1 Announce Type: new Large-scale pre-trained image-text models exhibit robust multimodal representations, yet applying the Contrastive Language-Image Pre-