Caption First, VQA Second: Knowledge Density, Not Task Format, Drives Multimodal Scaling

ArXi:2604.13054v1 Announce Type: cross Multimodal large language models (MLLMs) have achieved rapid progress, yet their scaling behavior remains less clearly characterized and often less predictable than that of text-only LLMs. Increasing model size and task diversity often yields diminishing returns. In this work, we argue that the primary bottleneck in multimodal scaling is not task format, but knowledge density in