Personalized LLM Benchmarks: Individual Rankings Diverge from Aggregate (ρ=0.04)

Key Takeaways A new study of 115 Chatbot Arena users finds personalized LLM rankings diverge dramatically from aggregate benchmarks, with an average Bradley-Terry correlation of only ρ=0.04. This challenges the validity of one-size-fits-all model evaluations. Personalized LLM Benchmarks: Individual Rankings Diverge from Aggregate (ρ=0.04) A new arXi preprint provides quantitative evidence that aggregate LLM benchmarks are poor predictors of individual user satisfaction.