Auto LLM ranking tool that uses a Judge LLM for any given task

r/LocalLLaMA
Generative AI AI Safety AI Research

Most people pick LLMs like this: try a few, go with the one that "felt better." That's not evaluation, that's a guess. To fix this, I built a small LLM auto-evaluation tool that replaces gut feel with task-specific benchmarking. This tool accepts a task in natural language and then uses a Judge LLM to generate task-specific test cases, runs parallel inference across candidate models, and scores outputs on accuracy, hallucination, grounding, tool-calling, and clarity. The tool outputs a ranked LLM list along with a system prompt optimized for the task.