DINO-Tok: Adapting DINO for Visual Tokenizers

ArXi:2511.20565v2 Announce Type: replace Recent advances in visual generation have emphasized the importance of Latent Generative Models (LGMs), which critically depend on effective visual tokenizers to bridge pixels and semantic representations. However, tokenizers constructed on pre-trained vision foundation models (VFMs) often struggle to balance semantic richness and reconstruction fidelity in high-dimensional latent spaces. In this paper, we