Generative Active Testing: Efficient LLM Evaluation via Proxy Task Adaptation

ArXi:2603.19264v1 Announce Type: cross With the widespread adoption of pre-trained Large Language Models (LLM), there exists a high demand for task-specific test sets to benchmark their performance in domains such as healthcare and biomedicine. However, the cost of labeling test samples while developing new benchmarks poses a significant challenge, especially when expert annotators are required. Existing frameworks for active sample selection offer limited for generative Question Answering tasks, where option dynamics can affect model decision boundaries.