G$^2$TR: Generation-Guided Visual Token Reduction for Separate-Encoder Unified Multimodal Models

ArXi:2605.12309v1 Announce Type: new The development of separate-encoder Unified multimodal models (UMMs) comes with a rapidly growing inference cost due to dense visual token processing. In this paper, we focus on understanding-side visual token reduction for improving the efficiency of separate-encoder UMMs. While this topic has been widely studied for MLLMs, existing methods typically rely on attention scores, text-image similarity and so on, implicitly assuming that the final objective is discriminative reasoning.