AI RESEARCH

AdaRubric: Task-Adaptive Rubrics for LLM Agent Evaluation

arXiv CS.AI

ArXi:2603.21362v1 Announce Type: new LLM-as-Judge evaluation fails agent tasks because a fixed rubric cannot capture what matters for this task: code debugging demands Correctness and Error Handling; web navigation demands Goal Alignment and Action Efficiency.