VideoFlexTok: Flexible-Length Coarse-to-Fine Video Tokenization

ArXi:2604.12887v1 Announce Type: cross Visual tokenizers map high-dimensional raw pixels into a compressed representation for downstream modeling. Beyond compression, tokenizers dictate what information is preserved and how it is organized. A de facto standard approach to video tokenization is to represent a video as a spatiotemporal 3D grid of tokens, each capturing the corresponding local information in the original signal.