Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

ArXi:2605.19196v1 Announce Type: new Deep research agents increasingly automate complex information-seeking tasks, producing evidence-grounded reports via multi-step reasoning, tool use, and synthesis. Their growing role demands scalable, reliable evaluation, positioning LLM-as-judge as a supervision paradigm for assessing factual accuracy, evidence use, and reasoning quality.