Refinement via Regeneration: Enlarging Modification Space Boosts Image Refinement in Unified Multimodal Models

ArXi:2604.25636v1 Announce Type: new Unified multimodal models (UMMs) integrate visual understanding and generation within a single framework. For text-to-image (T2I) tasks, this unified capability allows UMMs to refine outputs after their initial generation, potentially extending the performance upper bound. Current UMM-based refinement methods primarily follow a refinement-via-editing (RvE) paradigm, where UMMs produce editing instructions to modify misaligned regions while preserving aligned content.