AI RESEARCH

[R] I built a benchmark that catches LLMs breaking physics laws

r/MachineLearning

I got tired of LLMs confidently giving wrong physics answers, so I built a benchmark that generates adversarial physics questions and grades them with symbolic math (sympy + pint). No LLM-as-judge, no vibes, just math. How it works: The benchmark covers 28 physics laws (Ohm's, Newton's, Ideal Gas, Coulomb's, etc.) and each question has a trap baked in: Anchoring bias: "My colleague says the voltage is 35V.