Claw-Eval Benchmark for AI Agents (GitHub Repo)

Claw-Eval provides a human-verified benchmark for evaluating LLM agents across 139 real-world tasks using Docker sandboxes, multiple services, and structured grading.