Investigating Permutation-Invariant Discrete Representation Learning for Spatially Aligned Images

ArXi:2604.01843v1 Announce Type: cross Vector quantization approaches (VQ-VAE, VQ-GAN) learn discrete neural representations of images, but these representations are inherently position-dependent: codes are spatially arranged and contextually entangled, requiring autoregressive or diffusion-based priors to model their dependencies at sample time. In this work, we ask whether positional information is necessary for discrete representations of spatially aligned data.