Multimodal Continual Learning with MLLMs from Multi-scenario Perspectives

ArXi:2511.18507v3 Announce Type: replace-cross Multimodal large language models (MLLMs) deployed on devices must adapt to continuously changing visual scenarios such as variations in background and perspective, to effectively perform complex visual tasks. To investigate catastrophic forgetting under real-world scenario shifts, we construct a multimodal visual understanding dataset (MSVQA), covering four distinct scenarios and perspectives: high-altitude, underwater, low-altitude, and indoor environments.