AI RESEARCH
When No Benchmark Exists: Validating Comparative LLM Safety Scoring Without Ground-Truth Labels
arXiv CS.LG
•
ArXi:2605.06652v1 Announce Type: new Many deployments must compare candidate language models for safety before a labeled benchmark exists for the relevant language, sector, or regulatory regime. We formalize this setting as benchmarkless comparative safety scoring and specify the contract under which a scenario-based audit can be interpreted as deployment evidence. Scores are valid only under a fixed scenario pack, rubric, auditor, judge, sampling configuration, and rerun budget.