LMMs Meet Object-Centric Vision: Understanding, Segmentation, Editing and Generation

ArXi:2604.11789v1 Announce Type: new Large Multimodal Models (LMMs) have achieved remarkable progress in general-purpose vision--language understanding, yet they remain limited in tasks requiring precise object-level grounding, fine-grained spatial reasoning, and controllable visual manipulation. In particular, existing systems often struggle to identify the correct instance, preserve object identity across interactions, and localize or modify designated regions with high precision.