The cold-grill diagnostic that made me rewrite my Python learning protocol

I run an AI-engineering research lab that studies what it actually takes to work with Claude Code on hard technical surfaces, not from Claude Code. Two surfaces run in parallel: a learning protocol where Claude Opus is the coaching partner, and a QA-automation pipeline where Claude Code + MCP ship sprint reporting, Jira pulls, and Slack digests on a real work loop. Both surfaces stress-test the same operator pattern: spec-first, sub-agent orchestration, eval on agent output, foundational-fluency check.