BAGEL: Benchmarking Animal Knowledge Expertise in Language Models

ArXi:2604.16241v1 Announce Type: cross Large language models have shown strong performance on broad-domain knowledge and reasoning benchmarks, but it remains unclear how well language models handle specialized animal-related knowledge under a unified closed-book evaluation protocol. We