Omanic: Towards Step-wise Evaluation of Multi-hop Reasoning in Large Language Models

ArXi:2603.16654v1 Announce Type: cross Reasoning-focused large language models (LLMs) have advanced in many NLP tasks, yet their evaluation remains challenging: final answers alone do not expose the intermediate reasoning steps, making it difficult to determine whether a model truly reasons correctly and where failures occur, while existing multi-hop QA benchmarks lack step-level annotations for diagnosing reasoning failures.