In-Situ Behavioral Evaluation for LLM Fairness, Not Standardized-Test Scores

ArXi:2605.12530v1 Announce Type: cross LLM fairness should be evaluated through in-situ conversational behavior rather than standardized-test Q&A benchmarks. We show that the standardized-test paradigm can be structurally unreliable: surface-level prompt construction choices, although entirely orthogonal to the fairness question being tested, account for the majority of score variance, shift fairness