AI RESEARCH
Measuring Iterative Temporal Reasoning with Time Puzzles
arXiv CS.AI
•
ArXi:2601.07148v3 Announce Type: replace-cross Tool use, such as web search, has become a standard capability even in freely available large language models (LLMs). However, existing benchmarks evaluate temporal reasoning mainly in static, non-tool-using settings, which poorly reflect how LLMs perform temporal reasoning in practice. We