Coverage, Not Averages: Semantic Stratification for Trustworthy Retrieval Evaluation

ArXi:2604.20763v1 Announce Type: cross Retrieval quality is the primary bottleneck for accuracy and robustness in retrieval-augmented generation (RAG). Current evaluation relies on heuristically constructed query sets, which Experiments across multiple benchmarks and retrieval methods validate our framework. The results expose systematic coverage gaps, identify structural signals that explain variance in retrieval performance, and show that stratified evaluation yields stable and transparent assessments while ing trustworthy decision-making than aggregate metrics.