Multimodal Causal-Driven Representation Learning for Generalizable Medical Image Segmentation

ArXi:2508.05008v2 Announce Type: replace Vision-Language Models (VLMs), such as CLIP, have nstrated remarkable zero-shot capabilities in various computer vision tasks. However, their application to medical imaging remains challenging due to the high variability and complexity of medical data. Specifically, medical images often exhibit significant domain shifts caused by various confounders, including equipment differences, procedure artifacts, and imaging modes, which can lead to poor generalization when models are applied to unseen domains.