MedProbeBench: Systematic Benchmarking at Deep Evidence Integration for Expert-level Medical Guideline

ArXi:2604.18418v1 Announce Type: new Recent advances in deep research systems enable large language models to retrieve, synthesize, and reason over large-scale external knowledge. In medicine, developing clinical guidelines critically depends on such deep evidence integration. However, existing benchmarks fail to evaluate this capability in realistic workflows requiring multi-step evidence integration and expert-level judgment. To address this gap, we