GPT-5.5 improves over GPT-5.4 and overtakes Opus 4.6 to take the 2nd place behind Gemini 3.1 Pro on the Extended NYT Connections Benchmark

GPT-5.5: xhigh: 94.0→97.5 high: 93.6→96.9 medium: 92.0→95.0 no reasoning: 32.8→37.5 Kimi K2.6 improves over Kimi K2.5 (78.3→91.4) and becomes the open weights model. DeepSeek V4 Pro improves over DeepSeek V3.2 (50.2→75.7). DeepSeek V4 Flash scores 53.2. Qwen 3.6 Max Preview scores 82.2 (Qwen 3.6 Plus scored 71.3). Tencent Hy3 Preview scores 30.2. Ling 2.6 1T (no reasoning) scores 10.8. Previously: Opus 4.7 (high) scores 41.0 on the Extended NYT Connections Benchmark. Opus 4.7 (no reasoning) scores 15.3. Opus 4.7 (high) refuses to answer 54% of the puzzles.