ConsistCompose: Unified Multimodal Layout Control for Image Composition

ArXi:2511.18333v3 Announce Type: replace Unified multimodal models that couple visual understanding with image generation have advanced rapidly, yet most systems still focus on visual grounding-aligning language with image regions-while their generative counterpart, linguistic-embedded layout-grounded generation (LELG) for layout-controllable multi-instance generation, remains underexplored and limits precise compositional control.