RELIC: Evaluating Complex Reasoning via the Recognition of Languages In-Context

ArXi:2506.05205v2 Announce Type: replace Large language models (LLMs) are increasingly used to solve complex tasks where they must retrieve and compose many pieces of in-context information in long reasoning chains. For many real-world tasks it is hard to accurately gauge how model performance and strategy change as task complexity grows. To evaluate models' complex reasoning capability in a scalable and verifiable way, we