AI RESEARCH
[Project] JudgeGPT — open-source LLM-as-judge benchmarking tool with configurable scoring rubrics, CoT reasoning, and real-time GPU telemetry
r/MachineLearning
•
Sharing a tool I built that lets you run your own LLM-as-judge evaluations locally, against any models you have running via Ollama. The core problem with LLM-as-judge that I tried to address: LLM judges are notoriously unreliable out of the box - position bias, verbosity bias, self-family bias (~5-7% score inflation when the judge shares a model family with the evaluated model), and leniency clustering in smaller models. Most local benchmarking tools just wrap a judge prompt around a response and call it a score. I wanted something principled. What JudgeGPT does differently: 1.