STARFlow2: Bridging Language Models and Normalizing Flows for Unified Multimodal Generation

ArXi:2605.08029v1 Announce Type: cross Deep generative models have advanced rapidly across text and vision, motivating unified multimodal systems that can understand, reason over, and generate interleaved text-image sequences. Most existing approaches combine autoregressive language modeling with diffusion-based image generators, inheriting a structural mismatch between causal text generation and iterative visual denoising.