Prosa: Rubric-Based Evaluation of LLMs on Real User Chats in Brazilian Portuguese

ArXi:2605.01630v1 Announce Type: new Rankings produced by holistic LLM-as-a-judge scoring are sensitive to the bias of the chosen judge model. We show that switching to binary rubric scoring with multi-judge filtering removes this sensitivity: decomposing the judgement matters than the judge model itself. To this claim, we