Measuring Iterative Temporal Reasoning with Time Puzzles

ArXi:2601.07148v3 Announce Type: replace-cross Tool use, such as web search, has become a standard capability even in freely available large language models (LLMs). However, existing benchmarks evaluate temporal reasoning mainly in static, non-tool-using settings, which poorly reflect how LLMs perform temporal reasoning in practice. We