ARC-AGI 3 Paper alleges that Gemini 3 (and other frontier models) intentionally or not “cheated” their ARC-AGI 1 and 2 scores through memorisation of similar benchmark tasks during training