Failure of contextual invariance in gender inference with large language models

ArXi:2603.23485v1 Announce Type: cross Standard evaluation practices assume that large language model (LLM) outputs are stable under contextually equivalent formulations of a task. Here, we test this assumption in the setting of gender inference. Using a controlled pronoun selection task, we