ProfBench: Multi-Domain Rubrics requiring Professional Knowledge to Answer and Judge

ArXi:2510.18941v2 Announce Type: replace-cross Evaluating progress in large language models (LLMs) is often constrained by the challenge of verifying responses, limiting assessments to tasks like mathematics, programming, and short-form question-answering. However, many real-world applications require evaluating LLMs in processing professional documents, synthesizing information, and generating comprehensive reports in response to user queries. We