CheXmix: Unified Generative Pretraining for Vision Language Models in Medical Imaging

ArXi:2604.22989v1 Announce Type: new Recent medical multimodal foundation models are built as multimodal LLMs (MLLMs) by connecting a CLIP-pretrained vision encoder to an LLM using LLaVA-style finetuning. This two-stage, decoupled approach