Video Patch Pruning: Efficient Video Instance Segmentation via Early Token Reduction

ArXi:2604.00827v1 Announce Type: new Vision Transformers (ViTs) have nstrated state-ofthe-art performance in several benchmarks, yet their high computational costs hinders their practical deployment. Patch Pruning offers significant savings, but existing approaches restrict token reduction to deeper layers, leaving early-stage compression unexplored. This limits their potential for holistic efficiency. In this work, we present a novel Video Patch Pruning framework (VPP) that integrates temporal prior knowledge to enable efficient sparsity within early ViT layers.