Million -- token Context Window

Model-by-model takeaway Claude Opus 4.7 had the strongest baseline reasoning, but it suffered the sharpest evidence-quality erosion under heavy context and got worse even when given thinking budget. Claude Sonnet 4.6 was the surprise winner on heavy-fill pairwise tests, but it paid for that with very high reasoning-token usage and long latency. GPT-5.5 was the safest against hallucinations and cross-contamination, but it lost reasoning depth as context filled and showed a cliff drop at very high fill.