STELLAR-E: a Synthetic, Tailored, End-to-end LLM Application Rigorous Evaluator

ArXi:2604.24544v1 Announce Type: new The increasing reliance on Large Language Models (LLMs) across diverse sectors highlights the need for robust domain-specific and language-specific evaluation datasets; however, the collection of such datasets is challenging due to privacy concerns, regulatory restrictions, and the time cost for manual creation. Existing automated benchmarking methods are often limited by relying on pre-existing data, poor scalability, single-domain focus, and lack of multilingual.