From Drop-off to Recovery: A Mechanistic Analysis of Segmentation in MLLMs

ArXi:2603.17228v1 Announce Type: cross Multimodal Large Language Models (MLLMs) are increasingly applied to pixel-level vision tasks, yet their intrinsic capacity for spatial understanding remains poorly understood. We investigate segmentation capacity through a layerwise linear probing evaluation across the entire MLLM pipeline: vision encoder, adapter, and