MultiBanana: A Challenging Benchmark for Multi-Reference Text-to-Image Generation

ArXi:2511.22989v2 Announce Type: replace Recent text-to-image generation models have acquired the ability of multi-reference generation and editing; that is, to inherit the appearance of subjects from multiple reference images and re-render them in new contexts. However, existing benchmark datasets often focus on generation using a single or a few reference images, which prevents us from measuring progress in model performance or identifying weaknesses when following instructions with a larger number of references.