DocSeeker: Structured Visual Reasoning with Evidence Grounding for Long Document Understanding

ArXi:2604.12812v1 Announce Type: new Existing Multimodal Large Language Models (MLLMs) suffer from significant performance degradation on the long document understanding task as document length increases. This stems from two fundamental challenges: 1) a low Signal-to-Noise Ratio (SNR), with crucial evidence buried in irrelevant pages; and 2) supervision scarcity, as datasets offering only final short answers provide a weak learning signal.