FineViT: Progressively Unlocking Fine-Grained Perception with Dense Recaptions

ArXi:2603.17326v1 Announce Type: new While Multimodal Large Language Models (MLLMs) have experienced rapid advancements, their visual encoders frequently remain a performance bottleneck. Conventional CLIP-based encoders struggle with dense spatial tasks due to the loss of visual details caused by low-resolution pre