WebCoderBench: Benchmarking Web Application Generation with Comprehensive and Interpretable Evaluation Metrics

ArXi:2601.02430v2 Announce Type: replace-cross Web applications (web apps) have become a key arena for large language models (LLMs) to nstrate their code generation capabilities and commercial potential. However, building a benchmark for LLM-generated web apps remains challenging due to the need for real-world user requirements, generalizable evaluation metrics without relying on ground-truth implementations or test cases, and interpretable evaluation results. To address these challenges, we