A Survey on MLLM-based Visually Rich Document Understanding: Methods, Challenges, and Emerging Trends

ArXi:2507.09861v2 Announce Type: replace-cross Visually Rich Document Understanding (VRDU) has become a pivotal area of research, driven by the need to automatically interpret documents that contain intricate visual, textual, and structural elements. Recently, Multimodal Large Language Models (MLLMs) have nstrated significant promise in this domain, including both OCR-based and OCR-free approaches for information extraction from document images.