Claw-Eval Benchmark for AI Agents (GitHub Repo)
TLDR AI
•
Generative AI
AI Research
Claw-Eval provides a human-verified benchmark for evaluating LLM agents across 139 real-world tasks using Docker sandboxes, multiple services, and structured grading.