When Stability Fails: Hidden Failure Modes Of LLMS in Data-Constrained Scientific Decision-Making

ArXi:2603.15840v1 Announce Type: cross Large language models (LLMs) are increasingly used as decision- tools in data-constrained scientific workflows, where correctness and validity are critical. However, evaluation practices often emphasize stability or reproducibility across repeated runs. While these properties are desirable, stability alone does not guar- antee agreement with statistical ground truth when such references are available. We