CARPE: Context-Aware Image Representation Prioritization via Ensemble for Large Vision-Language Models

ArXi:2601.13622v2 Announce Type: replace Large vision-language models (LVLMs) are typically trained using autoregressive language modeling objectives, which align visual representations with linguistic space. While effective for multimodal reasoning, this alignment can weaken vision-centric capabilities, causing LVLMs to underperform their base vision encoders on tasks such as image classification.