Claw-Eval Benchmark for AI Agents (GitHub Repo)

TLDR AI
Generative AI AI Research

Claw-Eval provides a human-verified benchmark for evaluating LLM agents across 139 real-world tasks using Docker sandboxes, multiple services, and structured grading.