AI RESEARCH
Images in Sentences: Scaling Interleaved Instructions for Unified Visual Generation
arXiv CS.CV
•
ArXi:2605.12305v1 Announce Type: new While recent advancements in multimodal language models have enabled image generation from expressive multi-image instructions, existing methods struggle to maintain performance under complex interleaved instructions. This limitation stems from the structural separation of images and text in current paradigms, which forces models to bridge difficult long-range dependencies to match descriptions with visual targets.