ViTok-v2: Scaling Native Resolution Auto-Encoders to 5 Billion Parameters

ArXi:2605.05331v1 Announce Type: cross Vision Transformer (ViT) autoencoders have emerged as compelling tokenizers for images, offering improved reconstruction over convolutional tokenizers. However, existing ViT tokenizers cannot explore this landscape as performance degrades outside