FlowInOne:Unifying Multimodal Generation as Image-in, Image-out Flow Matching

ArXi:2604.06757v1 Announce Type: new Multimodal generation has long been dominated by text-driven pipelines where language dictates vision but cannot reason or create within it. We challenge this paradigm by asking whether all modalities, including textual descriptions, spatial layouts, and editing instructions, can be unified into a single visual representation.