How is Gemini 3.1 at the top of SWE-bench?

Genuinely confused. In my personal experience, it's nowhere near as reliable or capable as Claude Opus 4.6 or GPT 5.4 for real-world coding tasks. Those models feel way consistent, especially with complex debugging and reasoning. Are these benchmarks not reflecting actual developer workflows, or am I missing something here? submitted by /u/Additional-Alps-8209 [link] [comments]