Scaling Beyond Context: A Survey of Multimodal Retrieval-Augmented Generation for Document Understanding

ArXi:2510.15253v3 Announce Type: replace Document understanding is critical for applications from financial analysis to scientific discovery. Current approaches, whether OCR-based pipelines feeding Large Language Models (LLMs) or native Multimodal LLMs (MLLMs), face key limitations: the former loses structural detail, while the latter struggles with context modeling. Retrieval-Augmented Generation (RAG) helps ground models in external data, but documents' multimodal nature, i.e., combining text, tables, charts, and layout, demands a advanced paradigm: Multimodal RAG.