Orak: A Foundational Benchmark for Training and Evaluating LLM Agents on Diverse Video Games

ArXi:2506.03610v3 Announce Type: replace Large Language Model (LLM) agents are reshaping the game industry, by enabling intelligent and human-preferable characters. Yet, current game benchmarks fall short of practical needs: they lack evaluations of diverse LLM capabilities across various game genres, studies of agentic modules crucial for complex gameplay, and fine-tuning datasets to adapt pre-trained LLMs into gaming agents. To fill these gaps, we present Orak, a benchmark for