The First Karpathy Loop for Production Coding Agents

Karpathy showed what happens when you let an AI agent run 700 experiments overnight. The model proposes hypotheses, runs them, scores results, keeps what works, throws away what doesn't. Repeat. The part nobody talks about: how do you know which experiments actually mattered? I've been building with AI coding agents for months. Claude Code, Codex, Gemini CLI. The pattern is always the same: you give an agent a task, it runs, it produces output. Sometimes the output is good. Sometimes it's not. You squint at logs, compare diffs, make a judgment call. Move on.