[P] I built an open-source benchmark to test if LLMs are actually as confident as they claim to be (Spoiler: They often aren't)

Hey everyone, When building systems around modern open-source LLMs, one of the biggest issues is that they can confidently hallucinate or state an incorrect answer with a 95%+ probability. This makes it really hard to deploy them into the real world reliably if we don't understand their "overconfidence gaps." To dig into this, I built the LLM Confidence Calibration Benchmark. My goal was to analyze whether their stated output confidence mathematically aligns with their true correctness across different modes of thought.