Rethinking MLLM Itself as a Segmenter with a Single Segmentation Token

ArXi:2603.19026v1 Announce Type: new Recent segmentation methods leveraging Multi-modal Large Language Models (MLLMs) have shown reliable object-level segmentation and enhanced spatial perception. However, almost all previous methods predominantly rely on specialist mask decoders to interpret masks from generated segmentation-related embeddings and visual features, or incorporate multiple additional tokens to assist.