E2E-GMNER: End-to-End Generative Grounded Multimodal Named Entity Recognition

ArXi:2604.17319v1 Announce Type: cross Grounded Multimodal Named Entity Recognition (GMNER) aims to jointly identify named entity mentions in text, predict their semantic types, and ground each entity to a corresponding visual region in an associated image. Existing approaches predominantly adopt pipeline-based architectures that decouple textual entity recognition and visual grounding, leading to error accumulation and suboptimal joint optimization.