GBQA: A Game Benchmark for Evaluating LLMs as Quality Assurance Engineers

ArXi:2604.02648v1 Announce Type: cross The autonomous discovery of bugs remains a significant challenge in modern software development. Compared to code generation, the complexity of dynamic runtime environments makes bug discovery considerably harder for large language models (LLMs). In this paper, we take game development as a representative domain and