Show HN: Auto LLM Ranker – Describe a task in English and get ranked models

Hacker News Show AI
Generative AI AI Safety AI Research

I got tired of picking LLMs based on vibes and leaderboards that don't reflect real workloads, so I built this. You describe a task in plain English. The tool generates a test suite for that specific task, discovers candidate models via OpenRouter, benchmarks them in parallel, and uses a Judge LLM to score every response across 5 dimensions: accuracy, hallucination, grounding, tool-calling, and clarity. Output is a ranked top 3 with average latency per model and a task-specific system prompt optimized for the winner.