Clotho: Measuring Task-Specific Pre-Generation Test Adequacy for LLM Inputs

ArXi:2509.17314v3 Announce Type: replace-cross Software increasingly relies on the emergent capabilities of Large Language Models (LLMs), from natural language understanding to program analysis and generation. Yet testing them on specific tasks remains difficult and costly: many prompts lack ground truths, forcing reliance on human judgments, while existing test adequacy measures typically rely on output uncertainty and thus are only available after full inference.