The Partial Testimony of Logs: Evaluation of Language Model Generation under Confounded Model Choice

ArXi:2605.01311v1 Announce Type: new Offline evaluation of language models from usage logs is biased when model choice is confounded: the same user-side factors that influence which model is used can also influence how its output is judged, so raw comparisons of logged scores mix self-selected populations rather than estimating a common quantity of interest. A small randomized experiment can break this bias by overriding model choice, but in practice such experiments are scarce and costly.