Infinity and Beyond: Compositional Alignment in VAR and Diffusion T2I Models

ArXi:2512.11542v2 Announce Type: replace Achieving compositional alignment between textual descriptions and generated images - covering objects, attributes, and spatial relationships - remains a core challenge for modern text-to-image (T2I) models. Although diffusion-based architectures have been widely studied, the compositional behavior of emerging Visual Autoregressive (VAR) models is still largely unexamined.