Grounding Everything in Tokens for Multimodal Large Language Models

ArXi:2512.10554v2 Announce Type: replace Multimodal large language models (MLLMs) have made significant advancements in vision understanding and reasoning. However, the autoregressive Transformer architecture used by MLLMs requries tokenization on input images, which limits their ability to accurately ground objects within the 2D image space.