I created an LLM benchmark and I still can't believe how good Qwen3.5-122b performed

I've been working for 2 months on this game, literally all my time on it (the last time I went out of the apartment was on March 1st). It's a text-based strategy game with the most massive amount of incoming damage on both LLM sides. Each controls 4 small "countries" and one is Sovereign (most important). There is a memory system, where they self-form a new prompt, after examining the damage done to them, as well as what they inflicted upon the enemy, it truly measures if they're able to self-criticize and quickly change/adapt. This reflection happens over 20 times for each LLM per game.