Evaluating Multi-Hop Reasoning in RAG Systems: A Comparison of LLM-Based Retriever Evaluation Strategies

ArXi:2604.18234v1 Announce Type: cross Retrieval-augmented generation (RAG) enhances large language models (LLMs) with external knowledge to answer questions accurately. However, research on evaluating RAG systems-particularly the retriever component-remains limited, as most existing work focuses on single-context retrieval rather than multi-hop queries, where individual contexts may appear irrelevant in isolation but are essential when combined.