Token-Space Mask Prediction for Efficient Vision Transformer Segmentation

ArXi:2605.18177v1 Announce Type: new Query-based Vision Transformer segmentation models typically reconstruct dense spatial feature maps to predict masks, inheriting design patterns from convolutional architectures. We show that this explicit image-space reconstruction is not required. We