Claude Opus 4.7 (high) unexpectedly performs significantly worse than Opus 4.6 (high) on the Thematic Generalization Benchmark: 80.6 → 72.8.

Opus 4.7 (no reasoning) scores 52.6 compared to 68.8 for Opus 4.6. Opus 4.7 xhigh is not an improvement. This benchmark tests whether large language models can infer a specific latent theme from a few examples, use anti-examples to reject the broader but wrong pattern, and then identify the one true match among close distractors. One example of how Opus 4.7 fails: Theme: religious texts written on animal skin. 4.6 gets the conjunction right. 4.7 loses the material constraint and behaves as if "religious manuscript" alone is enough.