X2SAM: Any Segmentation in Images and Videos

ArXi:2605.00891v1 Announce Type: new Multimodal Large Language Models (MLLMs) have nstrated strong image-level visual understanding and reasoning, yet their pixel-level perception across both images and videos remains limited. Foundation segmentation models such as the SAM series produce high-quality masks, but they rely on low-level visual prompts and cannot natively interpret complex conversational instructions. Existing segmentation MLLMs narrow this gap, but are usually specialized for either images or videos and rarely both textual and visual prompts in one interface. We.