AI RESEARCH
Decomposing and Reducing Hidden Measurement Error in LLM Evaluation Pipelines
arXiv CS.CL
•
ArXi:2604.11581v1 Announce Type: new LLM evaluations drive which models get deployed, which safety standards get adopted, and which research