Anthropic locked Claude Code to native apps in Jan 2026. Are we still comparing models or just ecosystems

Quick context: both vendors publish SWE-bench scores on their own scaffolds. The same model swings 22 points depending on harness design - than the gap between any two frontier models. Since January 2026, Claude plans are restricted to Anthropic's native apps. Kilocode, Aider, any third-party tool: can't consume a Max subscription. Codex went the opposite direction and opened up to external integration. Result: every "Claude crushes GPT on code" comparison is Claude on Claude Code vs GPT on a random harness. That's not a model comparison. That's a product comparison.