Before You Tune Your Judge, Tune Your Rubric
Towards AI
•
Generative AI
Why most LLM-as-judge reliability work starts in the wrong place. You wired up an LLM judge three months ago. Scores were noisy. You raised K from 1 to 3, then to 5. You tightened temperature to 0. You went from a 50-example test set to 200. You tried a stronger model. You added chain-of-thought. Your scores are still noisy on the dimensions that matter most. This is the common trajectory, and most of those moves are aimed in the wrong direction. The dominant source of unreliable LLM judge scores is not the model and not the sampling strategy. It is the rubric itself.