A Semantic-Sampling Framework for Evaluating Calibration in Open-Ended Question Answering

ArXi:2605.08432v1 Announce Type: cross Calibration measures whether a model's predicted confidence aligns with its empirical accuracy, and is central to the reliable deployment of large language models (LLMs) in high-stakes domains such as medicine and law. While much recent work focuses on improving LLM calibration, the equally important question of how to evaluate it in realistic settings remains underdeveloped.