Claude Opus 4.7 (high) unexpectedly performs significantly worse than Opus 4.6 (high) on the Thematic Generalization Benchmark: 80.6 → 72.8.
r/singularity
•
Generative AI
AI Research
Opus 4.7 (no reasoning) scores 52.6 compared to 68.8 for Opus 4.6. Opus 4.7 xhigh is not an improvement. This benchmark tests whether large language models can infer a specific latent theme from a few examples, use anti-examples to reject the broader but wrong pattern, and then identify the one true match among close distractors. One example of how Opus 4.7 fails: Theme: religious texts written on animal skin. 4.6 gets the conjunction right. 4.7 loses the material constraint and behaves as if "religious manuscript" alone is enough.