Towards Unified Multimodal Interleaved Generation via Group Relative Policy Optimization

ArXi:2603.09538v1 Announce Type: new Unified vision-language models have made significant progress in multimodal understanding and generation, yet they largely fall short in producing multimodal interleaved outputs, which is a crucial capability for tasks like visual storytelling and step-by-step visual reasoning. In this work, we propose a reinforcement learning-based post-