Grok 4.20 Multi-Agent Beta burned 333k input tokens on a joke prompt - is this actually how multi-agent AI works right now?

I decided to test Grok 4.20 Multi-Agent Beta through OpenRouter on my personal LLM platform, mostly because there don’t seem to be any real benchmarks available for it yet. What made me curious is that Grok 4.2 single-agent looked pretty decent in the AA-Index, so I wanted to see how the multi-agent version would behave in practice. Since there isn’t much technical detail available, I basically went in blind just to get a feel for it. So I ran a pretty straightforward test.