Seemingly Simple Planning Problems are Computationally Challenging: The Countdown Game

ArXi:2508.02900v2 Announce Type: replace There is a broad consensus that the inability to form long-term plans is one of the key limitations of current foundational models and agents. However, the existing planning benchmarks remain woefully inadequate to truly measure their planning capabilities. Most existing benchmarks either focus on loosely defined tasks like travel planning or end up leveraging existing domains and problems from international planning competitions.