HypoSpace: Evaluating LLM Creativity as Set-Valued Hypothesis Generators under Underdetermination

ArXi:2510.15614v2 Announce Type: replace As language models are increasingly used in scientific workflows, evaluating their ability to propose sets of explanations-not just a single correct answer-becomes critical. Many scientific problems are underdetermined: multiple, mechanistically distinct hypotheses are consistent with the same observations. We