Blink: Dynamic Visual Token Resolution for Enhanced Multimodal Understanding

ArXi:2512.10548v2 Announce Type: replace Multimodal large language models (MLLMs) have achieved remarkable progress on various vision-language tasks, yet their visual perception remains limited. Humans, in comparison, perceive complex scenes efficiently by dynamically scanning and focusing on salient regions in a sequential "blink-like" process. Motivated by this strategy, we first investigate whether MLLMs exhibit similar behavior.