SPAR: Single-Pass Any-Resolution ViT for Open-vocabulary Segmentation

ArXi:2604.02252v1 Announce Type: new Foundational Vision Transformers (ViTs) have limited effectiveness in tasks requiring fine-grained spatial understanding, due to their fixed pre-