Jake Benchmark v1: I spent a week watching 7 local LLMs try to be AI agents with OpenClaw. Most couldn't even find the email tool.

I tested 7 local models on 22 real agent tasks using OpenClaw on a Raspberry Pi 5 with an RTX 3090 running Ollama. Tasks included reading emails, scheduling meetings, creating tasks, detecting phishing, handling errors, and browser automation. The winner by a massive margin: qwen3.5:27b-q4_K_M at 59.4%. The runner up (qwen3.5:35b) scored only 23.2%. Everything else was below 5%. Biggest surprises: The quantized 27B model beat the larger 35B version by 2.5x. A 30B model scored dead last at 1.6%. Medium thinking worked best. Too much thinking actually hurt performance.