Differences Between Opus 4.6 and Opus 4.7 on MineBench

Some Notes: For what's supposedly the SOTA model and beats all other models in essentially every benchmark, I expected it to be a lot consistent honestly You'll notice how sometimes it focused too much on the scenery (like the arcade or cottage builds), but the prompt has remained the same and Gemini 3.1 and GPT 5.4 were benchmarked with the same prompt The prompt encourages the model to decide when to focus on scenery individually, which might indicate that Opus 4.7 isn't as good at creative / brainstorming tasks as Opus 4.6 was? It might also be the adaptive thinking mode causing.