Images in Sentences: Scaling Interleaved Instructions for Unified Visual Generation

ArXi:2605.12305v1 Announce Type: new While recent advancements in multimodal language models have enabled image generation from expressive multi-image instructions, existing methods struggle to maintain performance under complex interleaved instructions. This limitation stems from the structural separation of images and text in current paradigms, which forces models to bridge difficult long-range dependencies to match descriptions with visual targets.