Aggregate vs. Personalized Judges in Business Idea Evaluation: Evidence from Expert Disagreement

ArXi:2604.22517v1 Announce Type: new Evaluating LLM-generated business ideas is often harder to scale than generating them. Unlike standard NLP benchmarks, business idea evaluation relies on multi-dimensional criteria such as feasibility, novelty, differentiation, user need, and market size, and expert judgments often disagree. This paper studies a methodological question raised by such disagreement: should an automatic judge approximate an aggregate consensus, or model evaluators individually? We.