DocScope: Benchmarking Verifiable Reasoning for Trustworthy Long-Document Understanding

ArXi:2605.08888v1 Announce Type: cross Evaluating whether Multimodal Large Language Models can produce trustworthy, verifiable reasoning over long, visually rich documents requires evaluation beyond end-to-end answer accuracy. We