Data Organization Matters in Multimodal Instruction Tuning: A Controlled Study of Capability Trade-offs

ArXi:2603.27744v1 Announce Type: new Recent multimodal large language models (MLLMs) perform strongly on general visual understanding, diagram and chart reasoning, and document-centric perception. However, these abilities are learned from heterogeneous supervision sources with very different task structures and learning demands, and the effect of their temporal organization during