CoVFT: Context-aware Visual Fine-tuning for Multimodal Large Language Models

ArXi:2603.21077v1 Announce Type: new Multimodal large language models (MLLMs) achieve remarkable progress in cross-modal perception and reasoning, yet a fundamental question remains unresolved: should the vision encoder be fine-tuned or frozen? Despite the success of models such as LLaVA and Qwen-VL, inconsistent design choices and heterogeneous