AI RESEARCH

Test-Time Matching: Unlocking Compositional Reasoning in Multimodal Models

arXiv CS.AI

ArXi:2510.07632v2 Announce Type: replace Frontier AI models have achieved remarkable progress, yet recent studies suggest they struggle with compositional reasoning, often performing at or below random chance on established benchmarks. We revisit this problem and show that widely used evaluation metrics systematically underestimate model capability. To correct this artifact, we