Investigating Cross-Modal Skill Injection: Scenarios, Methods, and Hyperparameters

ArXi:2605.19523v1 Announce Type: cross Vision-Language Models (VLMs) have nstrated remarkable proficiency in general multi-modal understanding; yet they struggle to efficiently acquire continually evolving domain-specific skills. Conventional approaches to enhancing VLM capabilities, such as Supervised Fine-Tuning (SFT), require extensive dataset curation and substantial computational resources. Model merging has emerged as an efficient alternative that enables the transfer of domain-specific expertise from Large Language Models (LLMs) to VLMs without incurring additional.