AI RESEARCH

Concurrent Criterion Validation of a Validity Screen for LLM Confidence Signals via Selective Prediction

arXiv CS.LG

ArXi:2604.17716v1 Announce Type: cross The validity screen (Cacioli, 2026d, 2026e) classifies LLM confidence signals as Valid, Indeterminate, or Invalid. We test whether these classifications predict selective prediction performance. Twenty frontier LLMs from seven families were evaluated on 524 items across six cognitive tracks. Valid models show mean Type 2 AUROC =.624 (SD =.048). Invalid models show mean AUROC =.357 (SD =.231). Cohen's d = 2.81, p =.002. The tiers order monotonically: Invalid (.357) < Indeterminate (.554) < Valid (.624.