BRIDGE: Benchmark for multi-hop Reasoning In long multimodal Documents with Grounded Evidence

ArXi:2603.07931v1 Announce Type: new Multi-hop question answering (QA) is widely used to evaluate the reasoning capabilities of large language models, yet most benchmarks focus on final answer correctness and overlook intermediate reasoning, especially in long multimodal documents. We