InterCoG: Towards Spatially Precise Image Editing with Interleaved Chain-of-Grounding Reasoning

ArXi:2603.01586v3 Announce Type: replace Emerging unified editing models have nstrated strong capabilities in general object editing tasks. However, it remains a significant challenge to perform fine-grained editing in complex multi-entity scenes, particularly those where targets are not visually salient and require spatial reasoning. To this end, we propose InterCoG, a novel text-vision Interleaved Chain-of-Grounding reasoning framework for fine-grained image editing in complex real-world scenes.