AI RESEARCH
Improving Reproducibility in Evaluation through Multi-Level Annotator Modeling
arXiv CS.AI
•
ArXi:2605.13801v1 Announce Type: cross As generative AI models such as large language models (LLMs) become pervasive, ensuring the safety, robustness, and overall trustworthiness of these systems is paramount. However, AI is currently facing a reproducibility crisis driven by unreliable evaluations and unrepeatable experimental results. While human raters are often used to assess models for utility and safety, they