Qwen 3.5 4B is the first small open-source model to solve this.

I ran a very small abstraction test: 11118888888855 -> 118885 79999775555 -> 99755 AAABBBYUDD ->? Qwen 3.5 4B was the first small open source model to solve it. That immediately caught my attention, because a lot of much bigger models failed. Models that failed this test in my runs: GPT-4 GPT-4o GPT-4.1 o1-mini o3-mini o4-mini OSS 20B OSS 120B Gemini 2.5 Flash All Qwen 2.5 sizes Qwen 3.0 only passed with Qwen3-235B-A22B-2507.