Slot-MLLM: Object-Centric Visual Tokenization for Multimodal LLM

ArXi:2505.17726v3 Announce Type: replace-cross Recently, multimodal large language models (MLLMs) have emerged as a key approach in achieving artificial general intelligence. In particular, vision-language MLLMs have been developed to generate not only text but also visual outputs from multimodal inputs. This advancement requires efficient image tokens that LLMs can process effectively both in input and output.