How I built AgentForge, an open-source agent harness that benchmarks 4 different memory architectures on real coding tasks.

Everyone in AI right now is arguing about which model is best. GPT vs Claude vs Gemini. Benchmark scores. Arena ratings. Token prices. I think they're asking the wrong question. LangChain proved it earlier this year: their coding agent jumped from outside the top 30 to the top 5 on Terminal Bench 2.0 by changing nothing about the model. They only changed the harness - the infrastructure that wraps the agent. Anthropic's own engineering team discovered that their agents exhibited "context anxiety" - performance degraded as context filled up, even after compaction.