Accelerating Vision Transformers with Adaptive Patch Sizes

ArXi:2510.18091v2 Announce Type: replace-cross Vision Transformers (ViTs) partition input images into uniformly sized patches regardless of their content, resulting in long input sequence lengths for high-resolution images. We present Adaptive Patch Transformers (APT), which addresses this by using multiple different patch sizes within the same image. APT reduces the total number of input tokens by allocating larger patch sizes in homogeneous areas and smaller patches in complex ones. APT achieves a drastic speedup in ViT inference and.