Brittlebench: Quantifying LLM robustness via prompt sensitivity

ArXi:2603.13285v1 Announce Type: cross Existing evaluation methods largely rely on clean, static benchmarks, which can overestimate true model performance by failing to capture the noise and variability inherent in real-world user inputs. This is especially true for language models, which can face human-generated text queries containing mistakes, typos, or alternative ways of phrasing the same question. In this work, we