[R] Beyond Final Answers: CRYSTAL Benchmark for Transparent Multimodal Reasoning Evaluation

Hey all, Quick share: we just dropped a paper where we stop grading models on just the final answer and start looking at whether they actually reason through the problem. TL;DR: We built CRYSTAL, 6,372 visual questions with verified step by step reasoning. The takeaway? Most models are really good at saying the right answer while skipping most of the actual thinking. The fun stuff: GPT5 gets 58% accuracy but only recovers 48% of the reasoning steps. It's basically vibing to the right answer. Gemma3 4B out reasons InternVL3.5 38B. 9.5x smaller. Size isn't everything.