Test-Time Matching: Unlocking Compositional Reasoning in Multimodal Models

ArXi:2510.07632v2 Announce Type: replace Frontier AI models have achieved remarkable progress, yet recent studies suggest they struggle with compositional reasoning, often performing at or below random chance on established benchmarks. We revisit this problem and show that widely used evaluation metrics systematically underestimate model capability. To correct this artifact, we