Channel Attention-Guided Cross-Modal Knowledge Distillation for Referring Image Segmentation

ArXi:2604.16806v1 Announce Type: new Referring image segmentation (RIS) requires accurate segmentation of target regions in images according to language descriptions, which is a cross-modal task integrating vision and language. Existing RIS methods typically employ large-scale vision and language encoding models to improve performance, but their enormous parameter size severely restricts deployment in scenarios with limited computing resources.